RTextTools bundles a host of functions for performing supervised learning on your data, but what about other methods like latent Dirichlet allocation? With some help from the topicmodels package, we can get started with LDA in just five steps. Text in green can be executed within R.

Step 1: Install RTextTools + topicmodels
We begin by installing and loading RTextTools and the topicmodels package into our R workspace.


Step 2: Load the Data
In this example, we will be using the bundled NYTimes dataset compiled by Amber E. Boydstun. This dataset contains headlines from front-page NYTimes articles. We will take a random sample of 1000 articles for the purposes of this tutorial.

data <- NYTimes[sample(1:3100,size=1000,replace=FALSE),]

Step 3: Create a DocumentTermMatrix
Using the create_matrix() function in RTextTools, we'll create a DocumentTermMatrix for use in the LDA() function from package topicmodels. Our text data consists of the Title and Subject columns of the NYTimes data. We will be removing numbers, stemming words, and weighting the DocumentTermMatrix by term frequency.

matrix <- create_matrix(cbind(as.vector(data$Title),as.vector(data$Subject)), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)

Step 4: Perform Latent Dirichlet Allocation
First we want to determine the number of topics in our data. In the case of the NYTimes dataset, the data have already been classified as a training set for supervised learning algorithms. Therefore, we can use the unique() function to determine the number of unique topic categories (k) in our data. Next, we use our matrix and this k value to generate the LDA model.

k <- length(unique(data$Topic.Code))
lda <- LDA(matrix, k)

Step 5: View the Results
Last, we can view the results by most likely term per topic, or most likely topic per document.

Topic 1  "campaign"  Topic 2  "kill"      Topic 3  "elect"     Topic 4  "china"     Topic 5  "govern"    Topic 6  "fight" Topic 7  "leader"    Topic 8  "york"      Topic 9  "isra"      Topic 10 "win"       Topic 11 "report"    Topic 12 "plan"
Topic 13 "republican"Topic 14 "aid"       Topic 15 "set"       Topic 16 "clinton"   Topic 17 "nation"    Topic 18 "hous"
Topic 19 "iraq"      Topic 20 "bush"      Topic 21 "citi"      Topic 22 "rais"      Topic 23 "overview"  Topic 24 "money"
Topic 25 "basebal"   Topic 26 "court"     Topic 27 "war"

Output too long to display here. Try it out for yourself to see what it looks like!
RTextTools v1.3 was released on August 21, and the package binaries are now available on CRAN. This update fixes a major bug with the stemmers, and it is highly recommended you upgrade to the latest version. Other changes include optimization of existing functions and improvements to the documentation.

Additionally, Duncan Temple Lang has graciously released Rstem on CRAN, meaning that the RTextTools package is now fully installable using the install.packages("RTextTools") command within R 2.13+. The repository at install.rtexttools.com will continue to work through the end of September.
RTextTools v1.2 was released today and we're pleased to announce that the package is finally available on CRAN. Additionally, this update brings minor changes to the API, improvements to the GLMNET algorithm, and more comprehensive documentation. Get started by following our installation instructions!

Additionally, Loren Collingwood will be giving a Kaleidoscope Session Talk today at the useR! 2011 conference in Coventry, UK. Loren is one of the lead developers on the RTextTools project and a Ph.D. candidate at the University of Washington in Seattle.

Thank you to R-bloggers and the machine learning subreddit for all the publicity and feedback we received for the v1.1 launch!
We recently created an AMI for Amazon's EC2 cloud computing service. Users with AWS accounts can access the public AMI by searching ami-817eb8e8. The AMI is based off of Drew Conway's excellent AMI, but with R 2.13 loaded and RTextTools and maxent installed.
A major upgrade of RTextTools has been released, including many optimizations, UI changes, and features based on feedback from the 2011 CAP Conference in Catania. Changes include the addition of a new low-memory algorithm GLMNET, full user documentation, simplification of the user interface, bundled datasets, better analytics for both virgin and non-virgin data classification, and simpler installation.

Give the latest release a spin and let us know what you think! Head over to the Install RTextTools page for installation instructions, and then read the quick-start guideview the documentation or download the example scripts to get started.

Please note that previous users of RTextTools will need to remove the maxent library prior to installation, as the library has been largely re-written and is now available via CRAN.