RTextTools: a machine learning library for text classification
  • Blog
  • About the Project
  • Install
  • How to Cite
  • Documentation

RTextTools v1.3.5: Saving models, text labels, and a game plan for 2012

1/1/2012

0 Comments

 
RTextTools v1.3.5 addresses some key concerns that have been raised in recent months. Many of the algorithms used in RTextTools require that any new data presented to a trained classifier contain the same features as the original document-term matrix. Since this rarely (if ever) happens in the real world, I have added an originalMatrix parameter to the create_matrix() function that adjusts new document-term matrices to contain the same terms as the original training matrix. Although this is a rather quirky work-around, it enables users to save trained models and classify new data easily. Example scripts are available in the /inst/examples/ directory of the RTextTools source code.

Since its introduction at the 2011 Comparative Agendas Project Conference in Catania, Italy, the RTextTools team has refined the API and implemented a number of features. Some of these features include n-gram analysis, text labels, comprehensive analytics, and a streamlined interface. Our plan for the year ahead includes a major overhaul of the nine algorithms to facilitate low-memory ensemble classification. However, this goal involves more than just the RTextTools team; it requires the R machine learning community to strive for efficient supervised learning algorithms. Many R packages do not utilize compressed sparse matrices, and therefore are limited in their applications for large-N data-sets. Therefore, we aim to promote efficient practices by package developers and write several implementations of our own to push the capabilities of statistical computing in R.

Thank you for all your feedback and support as we look forward to another productive year in 2012!
0 Comments

RTextTools v1.3.2 Released

12/19/2011

0 Comments

 
RTextTools was updated to version 1.3.2 today, adding support for n-gram token analysis, a faster maximum entropy algorithm, and numerous bug fixes. The source code has been synced with the Google Code repository, so please feel free to check out a copy and add your own features!

With the core feature set of RTextTools finalized, the next major release (v1.4.0) will focus on optimizing existing code and refining the API for the package. Furthermore, my goal is to add compressed sparse matrix support for all nine algorithms to reduce memory consumption; currently maximum entropy, support vector machines, and glmnet support compressed sparse matrices. 
0 Comments

The Problems with Pairing R + Java

9/3/2011

2 Comments

 
A core focus of the RTextTools project has been to make the package as accessible and user-friendly as possible. In its early iterations, the package contained dependencies such as RWeka, openNLP, and Snowball which, at least for our developers, did not present any challenges. However, as soon as we distributed the package to our beta testers, problems began cropping up all over the place: Java was not installed, the incorrect architecture of Java was installed, users were running out of heap space, or getting cryptic warning messages during runtime. The decision was made early on to make RTextTools a 100% Java-free installation, no matter what had to be done.

This decision has presented considerable challenges, because many natural language processing tools on CRAN are available exclusively in packages that require rJava: Porter stemmers require package Snowball, the only decent maximum entropy classifier requires openNLP, and n-gram tokenizing requires RWeka. Although there has been some success finding alternatives, such as using Rstem instead of Snowball, other features had to be written from the ground up in C++ to replace their counterparts (see package maxent). Even Rstem had its issues as it was only available on Omegahat, but luckily Duncan Temple Lang was willing to submit it to CRAN.

So why, you might ask, am I so adamantly opposed to mixing R and Java when it makes the developer's life easier? To be clear, I have nothing against Java as a programming language and I've used it extensively to develop Android apps and web apps. However, the integration of R and Java introduces a whole level of complexity that, in my opinion, should be absent from a statistical language such as R.

We started experiencing problems when RWeka and openNLP were first bundled with RTextTools during the alpha stages. These issues were generally installation problems: Java was not installed on the machine, or the wrong architecture of Java installed (32-bit vs 64-bit), or in the case of Linux users, the JRE was installed but not the SDK. But the problems didn't stop there. We began hearing of "java.lang.OutOfMemoryError: Java heap space" errors from users that were running AdaBoost from the RWeka package on massive training matrices. It turns out that Java defaults to 512MB of heap space, and these users were quickly exceeding the default settings. Additionally, some of the packages, RWeka in particular, were displaying cryptic error messages when a component was not properly installed.

Any software engineer can power through these problems in minutes, but when distributing an R package to thousands of users with varying levels of experience, these problems turn into frustration and wasted man-hours. It finally dawned on me that if I was expecting R users to figure out all these Java technicalities, they might as well use the Weka and openNLP packages directly in Eclipse.

Most readers will think I'm going to absurd lengths to avoid Java-- you are correct. In the end, the point of R, in my opinion, is to do away with all the technicalities of installation, managing memory, and interpreting error messages, and focus on applying the package's functionality to your project. Introducing Java to the equation impedes this goal, which is why I urge R developers to avoid using rJava whenever possible, even if it means taking the harder route.
2 Comments

Getting Started with Latent Dirichlet Allocation using RTextTools + topicmodels

8/30/2011

21 Comments

 
RTextTools bundles a host of functions for performing supervised learning on your data, but what about other methods like latent Dirichlet allocation? With some help from the topicmodels package, we can get started with LDA in just five steps. Text in green can be executed within R.

Step 1: Install RTextTools + topicmodels
We begin by installing and loading RTextTools and the topicmodels package into our R workspace.

install.packages(c("RTextTools","topicmodels"))
library(RTextTools)
library(topicmodels)

Step 2: Load the Data
In this example, we will be using the bundled NYTimes dataset compiled by Amber E. Boydstun. This dataset contains headlines from front-page NYTimes articles. We will take a random sample of 1000 articles for the purposes of this tutorial.

data(NYTimes)
data <- NYTimes[sample(1:3100,size=1000,replace=FALSE),]


Step 3: Create a DocumentTermMatrix
Using the create_matrix() function in RTextTools, we'll create a DocumentTermMatrix for use in the LDA() function from package topicmodels. Our text data consists of the Title and Subject columns of the NYTimes data. We will be removing numbers, stemming words, and weighting the DocumentTermMatrix by term frequency.

matrix <- create_matrix(cbind(as.vector(data$Title),as.vector(data$Subject)), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)

Step 4: Perform Latent Dirichlet Allocation
First we want to determine the number of topics in our data. In the case of the NYTimes dataset, the data have already been classified as a training set for supervised learning algorithms. Therefore, we can use the unique() function to determine the number of unique topic categories (k) in our data. Next, we use our matrix and this k value to generate the LDA model.

k <- length(unique(data$Topic.Code))
lda <- LDA(matrix, k)

Step 5: View the Results
Last, we can view the results by most likely term per topic, or most likely topic per document.

terms(lda)
Topic 1  "campaign"  Topic 2  "kill"      Topic 3  "elect"     Topic 4  "china"     Topic 5  "govern"    Topic 6  "fight" Topic 7  "leader"    Topic 8  "york"      Topic 9  "isra"      Topic 10 "win"       Topic 11 "report"    Topic 12 "plan"
Topic 13 "republican"Topic 14 "aid"       Topic 15 "set"       Topic 16 "clinton"   Topic 17 "nation"    Topic 18 "hous"
Topic 19 "iraq"      Topic 20 "bush"      Topic 21 "citi"      Topic 22 "rais"      Topic 23 "overview"  Topic 24 "money"
Topic 25 "basebal"   Topic 26 "court"     Topic 27 "war"

topics(lda)
Output too long to display here. Try it out for yourself to see what it looks like!
21 Comments

RTextTools v1.3 Released + Rstem Now Available on CRAN

8/22/2011

0 Comments

 
RTextTools v1.3 was released on August 21, and the package binaries are now available on CRAN. This update fixes a major bug with the stemmers, and it is highly recommended you upgrade to the latest version. Other changes include optimization of existing functions and improvements to the documentation.

Additionally, Duncan Temple Lang has graciously released Rstem on CRAN, meaning that the RTextTools package is now fully installable using the install.packages("RTextTools") command within R 2.13+. The repository at install.rtexttools.com will continue to work through the end of September.
0 Comments

RTextTools v1.2 Available on CRAN + useR! 2011 Kaleidoscope Session

8/16/2011

1 Comment

 
RTextTools v1.2 was released today and we're pleased to announce that the package is finally available on CRAN. Additionally, this update brings minor changes to the API, improvements to the GLMNET algorithm, and more comprehensive documentation. Get started by following our installation instructions!

Additionally, Loren Collingwood will be giving a Kaleidoscope Session Talk today at the useR! 2011 conference in Coventry, UK. Loren is one of the lead developers on the RTextTools project and a Ph.D. candidate at the University of Washington in Seattle.

Thank you to R-bloggers and the machine learning subreddit for all the publicity and feedback we received for the v1.1 launch!
1 Comment

RTextTools v1.1 Released

8/2/2011

0 Comments

 
A major upgrade of RTextTools has been released, including many optimizations, UI changes, and features based on feedback from the 2011 CAP Conference in Catania. Changes include the addition of a new low-memory algorithm GLMNET, full user documentation, simplification of the user interface, bundled datasets, better analytics for both virgin and non-virgin data classification, and simpler installation.

Give the latest release a spin and let us know what you think! Head over to the Install RTextTools page for installation instructions, and then read the quick-start guide, view the documentation or download the example scripts to get started.

Please note that previous users of RTextTools will need to remove the maxent library prior to installation, as the library has been largely re-written and is now available via CRAN.
0 Comments

RTextTools Improvements Underway

7/12/2011

0 Comments

 
Since RTextTool's unveiling at the 2011 Cap Conference in Catania, the development team has been busy working on refinements to the package. This includes a number of changes to simplify the API, improve analytics, decrease memory use, and increase functionality. We've added support for another low-memory algorithm (GLMNET) in addition to the two existing low-memory algorithms (SVM and MAXENT). Additionally, we've added n-fold cross validation for all nine algorithms.

Last, but certainly not least, we've fully documented RTextTools and are in the process of publishing the first paper about the package. Although RTextTools v1.1 is not yet released, you can get a sneak-peek of the functionality by glancing at the documentation.
0 Comments

Next Steps: Drafting the R Help Files

6/26/2011

0 Comments

 
With RTextTools now released and the feedback rolling in, the development team is getting the ball rolling on the help documentation for the library. Currently, you cannot access help files about the library or its functions from within R. However, we do offer a draft of a quick start guide in PDF format under the Documentation section of the website.

Stay tuned for a release in the next few months with a slew of new features and refinements.
0 Comments

Binary Installation Now Available

6/18/2011

0 Comments

 
The biggest complaint we had during the installation process was that Xcode (account required) and Rtools were required for MacOS X and Windows. Today we released universal binaries (PPC/i386/x86_64) for MacOS 10.5+ as well as binaries (i386/x86_64) for Windows. This addition will significantly reduce the amount of time it takes to install RTextTools.

We're gearing up to unveil RTextTools next Friday in Catania. See you there!
0 Comments
<<Previous

    Developer Blog

    Updates on RTextTools progress, tips, and examples.

    ​Note: RTextTools is no longer actively maintained -- the software may contain bugs that will not be fixed with newer versions of R.

    By Author

    All
    Loren Collingwood
    Timothy P. Jurka

    By Date

    February 2012
    January 2012
    December 2011
    September 2011
    August 2011
    July 2011
    June 2011
    May 2011
    April 2011
    March 2011

    RSS Feed

Powered by Create your own unique website with customizable templates.