RTextTools: a machine learning library for text classification
  • Blog
  • About the Project
  • Install
  • How to Cite
  • Documentation

RTextTools v1.1 Released

8/2/2011

0 Comments

 
A major upgrade of RTextTools has been released, including many optimizations, UI changes, and features based on feedback from the 2011 CAP Conference in Catania. Changes include the addition of a new low-memory algorithm GLMNET, full user documentation, simplification of the user interface, bundled datasets, better analytics for both virgin and non-virgin data classification, and simpler installation.

Give the latest release a spin and let us know what you think! Head over to the Install RTextTools page for installation instructions, and then read the quick-start guide, view the documentation or download the example scripts to get started.

Please note that previous users of RTextTools will need to remove the maxent library prior to installation, as the library has been largely re-written and is now available via CRAN.
0 Comments

RTextTools Improvements Underway

7/12/2011

0 Comments

 
Since RTextTool's unveiling at the 2011 Cap Conference in Catania, the development team has been busy working on refinements to the package. This includes a number of changes to simplify the API, improve analytics, decrease memory use, and increase functionality. We've added support for another low-memory algorithm (GLMNET) in addition to the two existing low-memory algorithms (SVM and MAXENT). Additionally, we've added n-fold cross validation for all nine algorithms.

Last, but certainly not least, we've fully documented RTextTools and are in the process of publishing the first paper about the package. Although RTextTools v1.1 is not yet released, you can get a sneak-peek of the functionality by glancing at the documentation.
0 Comments

Next Steps: Drafting the R Help Files

6/26/2011

0 Comments

 
With RTextTools now released and the feedback rolling in, the development team is getting the ball rolling on the help documentation for the library. Currently, you cannot access help files about the library or its functions from within R. However, we do offer a draft of a quick start guide in PDF format under the Documentation section of the website.

Stay tuned for a release in the next few months with a slew of new features and refinements.
0 Comments

Binary Installation Now Available

6/18/2011

0 Comments

 
The biggest complaint we had during the installation process was that Xcode (account required) and Rtools were required for MacOS X and Windows. Today we released universal binaries (PPC/i386/x86_64) for MacOS 10.5+ as well as binaries (i386/x86_64) for Windows. This addition will significantly reduce the amount of time it takes to install RTextTools.

We're gearing up to unveil RTextTools next Friday in Catania. See you there!
0 Comments

RTextTools now 100% Java-free!

6/15/2011

0 Comments

 
When we first wrote RTextTools, we opted to use RWeka for boosting and bagging algorithms for lack of a better alternative. We've discovered that this leads to all sorts of ugly rJava installation issues across platforms and prevents our users from getting started quickly. Recently, we've stumbled upon two excellent non-Java alternatives: LogitBoost in the caTools package, and the bagging implementation in the ipred package.

Consequently, we've eliminated RWeka from the list of dependencies and significantly streamlined the installation process. Windows users only need to deal with the Rtools installation now; developer Loren Collingwood left a helpful comment regarding Rtools in a previous post.
0 Comments

Maximum Entropy Now Supported for Windows

6/10/2011

1 Comment

 
After several weeks trying to find the source of a bug in the maximum entropy library when compiling on Windows, Dirk Eddelbuttel pointed me in the right direction to resolve the issue. Although it required a re-write of the library using the new Rcpp API, maximum entropy now installs on Windows machines when Rtools is installed.

This is significant because now RTextTools has two low-memory algorithms (support vector machines and maximum entropy) working cross-platform. This significantly raises accuracy for Windows users, and simplifies the installation process.
1 Comment

Drafting the Documentation for RTextTools

6/7/2011

0 Comments

 
In preparation for The 4th Annual Conference of the Comparative Policy Agendas Project in Catania, Sicily, our development team has been busy drafting the documentation for RTextTools. In addition to standard documentation of functions, we want to provide quick-start guides, sample datasets, example scripts, and Amazon EC2 instructions to make it as easy as possible for researchers to get up and running.

A big part of developing the documentation is understanding what methods work best for which datasets. We welcome any feedback you may have regarding what worked for you, and what dataset you were operating on. You can find our contact information on the About the Project page.
0 Comments

Reduce Memory Use for Large Datasets

6/1/2011

0 Comments

 
One key limiting factor for automated text classification is memory consumption. As you accumulate more news articles, bills, and legal opinions, the term-document matrices used to represent the data grow quickly. RTextTools provides two algorithms, support vector machines and maximum entropy, that can handle large datasets with very little memory. Luckily, these two algorithms tend to be the most accurate as well. However, some applications require an ensemble of more than two algorithms to get an accurate scoring of topic codes.

First, you can try reducing the number of terms in your matrix. The create_matrix() function provides many features that can help remove noise from your dataset. There are the defaults- removing stopwords, removing punctuation, making words lowercase, and stripping whitespace- but also some other helpful tools. You can set minimum word length (e.g. minWordLength=5), select the N most frequent terms from each document (e.g. selectFreqTerms=25), setting a minimum word frequency per document (e.g. minDocFreq=3), and remove terms with a sparse factor of less than N (e.g. removeSparseTerms=0.9998).

These options can help you reduce the size of your document matrix, but they also can remove some information that may be valuable for the learning algorithms. If you just need the resources to run a huge dataset, and nothing above helps, you should look into setting up an Amazon EC2 instance with RStudio installed. We plan to create a simple way of doing this in the near future, but you'll have to brave the stormy waters for now. Be warned, this option is for experienced users only!
0 Comments

Preparing RTextTools Beta Release for Catania 2011

5/23/2011

0 Comments

 
Right now our development team is busy preparing a conference release of RTextTools for The 4th Annual Conference of the Comparative Policy Agendas Project at the University of Catania in Sicily. One of the key issues we've had thus far is memory consumption with very large datasets.

In the past week we've pushed out a slew of updates that allow the support vector machine and maximum entropy algorithms to run with low memory requirements, even on very large datasets. Unfortunately, not all the algorithms used in RTextTools support the changes we've made, so this leaves us with a two algorithm ensemble for low-memory classification. However, SVM and maxent tend to be the most accurate algorithms in our tests, meaning that a large ensemble isn't necessary to get high consensus accuracy.
0 Comments

RStudio and RTextTools: A Perfect Pairing

4/15/2011

1 Comment

 
The development team has spent the past six months creating the best possible experience for RTextTools users. A few months into development, we heard about a new IDE called RStudio, which has one of the cleanest interfaces to R we've seen. It integrates many R tools (graphing, file management, workspace management, tabbed source editor, and more) into a single, customizable interface.

Most of the development for RTextTools happens right in RStudio, as does lots of the user testing. We've found it not only runs more smoothly, but also lets us easily import and view the datasets we'll be working with. And the best part? It's free, open-source, and cross-platform.
1 Comment
<<Previous
Forward>>

    Developer Blog

    Updates on RTextTools progress, tips, and examples.

    ​Note: RTextTools is no longer actively maintained -- the software may contain bugs that will not be fixed with newer versions of R.

    By Author

    All
    Loren Collingwood
    Timothy P. Jurka

    By Date

    February 2012
    January 2012
    December 2011
    September 2011
    August 2011
    July 2011
    June 2011
    May 2011
    April 2011
    March 2011

    RSS Feed

Powered by Create your own unique website with customizable templates.