RTextTools: a machine learning library for text classification

Classifying Breast Cancer as Benign or Malignant Using RTextTools

2/11/2012

RTextTools has largely been used for topic classification in the social sciences. However, recent discussions with researchers at various universities have demonstrated that the package can be applied to a host of problems in the natural sciences as well.

One such application is using text classification to identify breast cancer masses as benign or malignant. Using the Wisconsin Diagnostic Breast Cancer Dataset from UC Irvine, we wrote a script that trains eight classifiers on characteristics such as clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. When run on the data, the classifiers were able to achieve up to 96% recall accuracy on a randomly sampled training set of 200 patients and test set of 400 patients.

The source code is available below, and the dataset is automatically downloaded from UC Irvine's servers. If you've found RTextTools useful in your research, we'd love to hear about it!

30 Comments

RTextTools Short Course Materials

2/10/2012

0 Comments

Attached are some of the materials from the recent short course at UNC. For confidential reasons, we are unable to present all of the materials, but this is enough to get someone started. 1. Lecture; 2. Intro to R; 3. NY Times; 4. Congressional Bills. Hope this proves helpful.

0 Comments

Successful Two Day Workshop at UNC-Chapel Hill

2/8/2012

4 Comments

This week the Odum Institute at UNC held a two day short course on text classification with RTextTools. The workshop, led by Loren Collingwood, covered the basics of content analysis, supervised learning and text classification, introduction to R, and how to use RTextTools. Participants brought in their own data on the second day, which the instructor helped them classify. Based on feedback, the course was a success. Do not hesitate to contact us if your university, department, or company is interested in such a course.

4 Comments

RTextTools v1.3.5: Saving models, text labels, and a game plan for 2012

1/1/2012

0 Comments

RTextTools v1.3.5 addresses some key concerns that have been raised in recent months. Many of the algorithms used in RTextTools require that any new data presented to a trained classifier contain the same features as the original document-term matrix. Since this rarely (if ever) happens in the real world, I have added an originalMatrix parameter to the create_matrix() function that adjusts new document-term matrices to contain the same terms as the original training matrix. Although this is a rather quirky work-around, it enables users to save trained models and classify new data easily. Example scripts are available in the /inst/examples/ directory of the RTextTools source code.

Since its introduction at the 2011 Comparative Agendas Project Conference in Catania, Italy, the RTextTools team has refined the API and implemented a number of features. Some of these features include n-gram analysis, text labels, comprehensive analytics, and a streamlined interface. Our plan for the year ahead includes a major overhaul of the nine algorithms to facilitate low-memory ensemble classification. However, this goal involves more than just the RTextTools team; it requires the R machine learning community to strive for efficient supervised learning algorithms. Many R packages do not utilize compressed sparse matrices, and therefore are limited in their applications for large-N data-sets. Therefore, we aim to promote efficient practices by package developers and write several implementations of our own to push the capabilities of statistical computing in R.

Thank you for all your feedback and support as we look forward to another productive year in 2012!

0 Comments

RTextTools v1.3.2 Released

12/19/2011

0 Comments

RTextTools was updated to version 1.3.2 today, adding support for n-gram token analysis, a faster maximum entropy algorithm, and numerous bug fixes. The source code has been synced with the Google Code repository, so please feel free to check out a copy and add your own features!

With the core feature set of RTextTools finalized, the next major release (v1.4.0) will focus on optimizing existing code and refining the API for the package. Furthermore, my goal is to add compressed sparse matrix support for all nine algorithms to reduce memory consumption; currently maximum entropy, support vector machines, and glmnet support compressed sparse matrices.

0 Comments

The Problems with Pairing R + Java

9/3/2011

2 Comments

A core focus of the RTextTools project has been to make the package as accessible and user-friendly as possible. In its early iterations, the package contained dependencies such as RWeka, openNLP, and Snowball which, at least for our developers, did not present any challenges. However, as soon as we distributed the package to our beta testers, problems began cropping up all over the place: Java was not installed, the incorrect architecture of Java was installed, users were running out of heap space, or getting cryptic warning messages during runtime. The decision was made early on to make RTextTools a 100% Java-free installation, no matter what had to be done.

This decision has presented considerable challenges, because many natural language processing tools on CRAN are available exclusively in packages that require rJava: Porter stemmers require package Snowball, the only decent maximum entropy classifier requires openNLP, and n-gram tokenizing requires RWeka. Although there has been some success finding alternatives, such as using Rstem instead of Snowball, other features had to be written from the ground up in C++ to replace their counterparts (see package maxent). Even Rstem had its issues as it was only available on Omegahat, but luckily Duncan Temple Lang was willing to submit it to CRAN.

So why, you might ask, am I so adamantly opposed to mixing R and Java when it makes the developer's life easier? To be clear, I have nothing against Java as a programming language and I've used it extensively to develop Android apps and web apps. However, the integration of R and Java introduces a whole level of complexity that, in my opinion, should be absent from a statistical language such as R.

We started experiencing problems when RWeka and openNLP were first bundled with RTextTools during the alpha stages. These issues were generally installation problems: Java was not installed on the machine, or the wrong architecture of Java installed (32-bit vs 64-bit), or in the case of Linux users, the JRE was installed but not the SDK. But the problems didn't stop there. We began hearing of "java.lang.OutOfMemoryError: Java heap space" errors from users that were running AdaBoost from the RWeka package on massive training matrices. It turns out that Java defaults to 512MB of heap space, and these users were quickly exceeding the default settings. Additionally, some of the packages, RWeka in particular, were displaying cryptic error messages when a component was not properly installed.

Any software engineer can power through these problems in minutes, but when distributing an R package to thousands of users with varying levels of experience, these problems turn into frustration and wasted man-hours. It finally dawned on me that if I was expecting R users to figure out all these Java technicalities, they might as well use the Weka and openNLP packages directly in Eclipse.

Most readers will think I'm going to absurd lengths to avoid Java-- you are correct. In the end, the point of R, in my opinion, is to do away with all the technicalities of installation, managing memory, and interpreting error messages, and focus on applying the package's functionality to your project. Introducing Java to the equation impedes this goal, which is why I urge R developers to avoid using rJava whenever possible, even if it means taking the harder route.

2 Comments

Getting Started with Latent Dirichlet Allocation using RTextTools + topicmodels

8/30/2011

21 Comments

RTextTools bundles a host of functions for performing supervised learning on your data, but what about other methods like latent Dirichlet allocation? With some help from the topicmodels package, we can get started with LDA in just five steps. Text in green can be executed within R.

Step 1: Install RTextTools + topicmodels
We begin by installing and loading RTextTools and the topicmodels package into our R workspace.

install.packages(c("RTextTools","topicmodels"))
library(RTextTools)
library(topicmodels)

Step 2: Load the Data
In this example, we will be using the bundled NYTimes dataset compiled by Amber E. Boydstun. This dataset contains headlines from front-page NYTimes articles. We will take a random sample of 1000 articles for the purposes of this tutorial.

data(NYTimes)
data <- NYTimes[sample(1:3100,size=1000,replace=FALSE),]

Step 3: Create a DocumentTermMatrix
Using the create_matrix() function in RTextTools, we'll create a DocumentTermMatrix for use in the LDA() function from package topicmodels. Our text data consists of the Title and Subject columns of the NYTimes data. We will be removing numbers, stemming words, and weighting the DocumentTermMatrix by term frequency.

matrix <- create_matrix(cbind(as.vector(data$Title),as.vector(data$Subject)), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)

Step 4: Perform Latent Dirichlet Allocation
First we want to determine the number of topics in our data. In the case of the NYTimes dataset, the data have already been classified as a training set for supervised learning algorithms. Therefore, we can use the unique() function to determine the number of unique topic categories (k) in our data. Next, we use our matrix and this k value to generate the LDA model.

k <- length(unique(data$Topic.Code))
lda <- LDA(matrix, k)

Step 5: View the Results
Last, we can view the results by most likely term per topic, or most likely topic per document.

terms(lda)
Topic 1 "campaign" Topic 2 "kill" Topic 3 "elect"   Topic 4 "china"   Topic 5 "govern" Topic 6 "fight" Topic 7 "leader" Topic 8 "york" Topic 9 "isra" Topic 10 "win"   Topic 11 "report" Topic 12 "plan"
Topic 13 "republican"Topic 14 "aid"   Topic 15 "set"   Topic 16 "clinton"   Topic 17 "nation" Topic 18 "hous"
Topic 19 "iraq" Topic 20 "bush" Topic 21 "citi" Topic 22 "rais" Topic 23 "overview" Topic 24 "money"
Topic 25 "basebal"   Topic 26 "court"   Topic 27 "war"

topics(lda)
Output too long to display here. Try it out for yourself to see what it looks like!

21 Comments

RTextTools v1.3 Released + Rstem Now Available on CRAN

8/22/2011

0 Comments

RTextTools v1.3 was released on August 21, and the package binaries are now available on CRAN. This update fixes a major bug with the stemmers, and it is highly recommended you upgrade to the latest version. Other changes include optimization of existing functions and improvements to the documentation.

Additionally, Duncan Temple Lang has graciously released Rstem on CRAN, meaning that the RTextTools package is now fully installable using the install.packages("RTextTools") command within R 2.13+. The repository at install.rtexttools.com will continue to work through the end of September.

0 Comments

RTextTools v1.2 Available on CRAN + useR! 2011 Kaleidoscope Session

8/16/2011

1 Comment

RTextTools v1.2 was released today and we're pleased to announce that the package is finally available on CRAN. Additionally, this update brings minor changes to the API, improvements to the GLMNET algorithm, and more comprehensive documentation. Get started by following our installation instructions!

Additionally, Loren Collingwood will be giving a Kaleidoscope Session Talk today at the useR! 2011 conference in Coventry, UK. Loren is one of the lead developers on the RTextTools project and a Ph.D. candidate at the University of Washington in Seattle.

Thank you to R-bloggers and the machine learning subreddit for all the publicity and feedback we received for the v1.1 launch!

1 Comment

Amazon Machine Image Created With RTextTools Pre-installed

8/9/2011

0 Comments

We recently created an AMI for Amazon's EC2 cloud computing service. Users with AWS accounts can access the public AMI by searching ami-817eb8e8. The AMI is based off of Drew Conway's excellent AMI, but with R 2.13 loaded and RTextTools and maxent installed.

0 Comments

<<Previous

Classifying Breast Cancer as Benign or Malignant Using RTextTools

RTextTools Short Course Materials

Successful Two Day Workshop at UNC-Chapel Hill

RTextTools v1.3.5: Saving models, text labels, and a game plan for 2012

RTextTools v1.3.2 Released

The Problems with Pairing R + Java

Getting Started with Latent Dirichlet Allocation using RTextTools + topicmodels

RTextTools v1.3 Released + Rstem Now Available on CRAN

RTextTools v1.2 Available on CRAN + useR! 2011 Kaleidoscope Session

Amazon Machine Image Created With RTextTools Pre-installed

Developer Blog

By Author

By Date