RTextTools: a machine learning library for text classification
  • Blog
  • About the Project
  • Install
  • How to Cite
  • Documentation

Getting Started with Latent Dirichlet Allocation using RTextTools + topicmodels

8/30/2011

21 Comments

 
RTextTools bundles a host of functions for performing supervised learning on your data, but what about other methods like latent Dirichlet allocation? With some help from the topicmodels package, we can get started with LDA in just five steps. Text in green can be executed within R.

Step 1: Install RTextTools + topicmodels
We begin by installing and loading RTextTools and the topicmodels package into our R workspace.

install.packages(c("RTextTools","topicmodels"))
library(RTextTools)
library(topicmodels)

Step 2: Load the Data
In this example, we will be using the bundled NYTimes dataset compiled by Amber E. Boydstun. This dataset contains headlines from front-page NYTimes articles. We will take a random sample of 1000 articles for the purposes of this tutorial.

data(NYTimes)
data <- NYTimes[sample(1:3100,size=1000,replace=FALSE),]


Step 3: Create a DocumentTermMatrix
Using the create_matrix() function in RTextTools, we'll create a DocumentTermMatrix for use in the LDA() function from package topicmodels. Our text data consists of the Title and Subject columns of the NYTimes data. We will be removing numbers, stemming words, and weighting the DocumentTermMatrix by term frequency.

matrix <- create_matrix(cbind(as.vector(data$Title),as.vector(data$Subject)), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)

Step 4: Perform Latent Dirichlet Allocation
First we want to determine the number of topics in our data. In the case of the NYTimes dataset, the data have already been classified as a training set for supervised learning algorithms. Therefore, we can use the unique() function to determine the number of unique topic categories (k) in our data. Next, we use our matrix and this k value to generate the LDA model.

k <- length(unique(data$Topic.Code))
lda <- LDA(matrix, k)

Step 5: View the Results
Last, we can view the results by most likely term per topic, or most likely topic per document.

terms(lda)
Topic 1  "campaign"  Topic 2  "kill"      Topic 3  "elect"     Topic 4  "china"     Topic 5  "govern"    Topic 6  "fight" Topic 7  "leader"    Topic 8  "york"      Topic 9  "isra"      Topic 10 "win"       Topic 11 "report"    Topic 12 "plan"
Topic 13 "republican"Topic 14 "aid"       Topic 15 "set"       Topic 16 "clinton"   Topic 17 "nation"    Topic 18 "hous"
Topic 19 "iraq"      Topic 20 "bush"      Topic 21 "citi"      Topic 22 "rais"      Topic 23 "overview"  Topic 24 "money"
Topic 25 "basebal"   Topic 26 "court"     Topic 27 "war"

topics(lda)
Output too long to display here. Try it out for yourself to see what it looks like!
21 Comments
Bob Muenchen link
10/17/2011 12:13:11 am

Thanks for the helpful example. I've only used SVD and had been wanting to try LDA.

Cheers,
Bob Muenchen

Reply
Eric Brown
2/9/2012 11:46:43 pm

I tried to run this code, but when I got to the LDA command:

lda <- LDA(matrix, k)

Error in LDA(matrix, k) :
Each row of the input matrix needs to contain at least one non-zero entry

Reply
Brian link
3/27/2012 01:13:15 am

I'm having the same problem. Any thoughts?

Reply
Timothy P. Jurka link
3/27/2012 04:47:48 am

Hi Brian,

The topicmodels packages has changed since this demo was written. I will need to go back and see what's going on.

Best,
Tim

amelia
3/5/2013 08:32:31 am

here is a way to remove those empty rows (though this may not actually address the underlying issue):

http://stackoverflow.com/questions/13944252/remove-empty-documents-from-documenttermmatrix-in-r-topicmodels

Reply
Tim Jurka
2/10/2012 10:09:47 am

Hi Eric, thanks for pointing this out. It looks like package topicmodels has been updated since our script was originally written. I'll look into how to fix it.

Reply
Ranga
4/21/2012 08:42:36 am

Works Perfect for me. Thanks load to Tim and all others in the development Team.

Reply
Sarah
8/2/2012 04:02:06 am

Hi - I've started using this tool and have a metadata question. The DTM preserves the names of the articles I put in, but in the LDA process the articles simply become row numbers. Do you know how to reference back to find out which article is which row number? I need to sort the data within the topics and can't do it without their names...

Reply
Dan
9/27/2012 07:20:11 pm

First of all, great work, guys, and a big thank you!

I'd just like to know if there are any news on the issue reported by Eric?

Error in LDA(matrix, k) :
Each row of the input matrix needs to contain at least one non-zero entry

Thanks a lot!

Reply
Timothy P. Jurka link
12/1/2012 05:40:21 pm

Hi Dan,

The problem should be fixed now.

Best,
Tim

Reply
Yulia
3/28/2015 01:24:06 pm

Hi Tim,
I tried to run your code but there was an error when you try to create a matrix:

Error in create_matrix(cbind(as.vector(data$Title), as.vector(data$Subject)), :
object 'weightTf' not found

Any ideas?
Thanks,
Yulia

Antony Stevens
10/8/2012 06:07:13 am

many thanks for your helpful example

Reply
Big Mike
12/14/2012 07:14:46 am

Are the topics as return by 'terms(lda)' represented by just the most frequent term? Is there a way to find all the terms in the topic?

Reply
Big Mike
12/14/2012 07:17:41 am

Sorry, that was an "RTFM" question. Never mind.

Reply
Ben
12/16/2012 10:31:39 am

I am now getting the "Each row of the input matrix needs to contain at least one non-zero entry" error. Did the fix work for the rest of you guys?

Reply
Frederik Hjorth link
4/8/2013 05:37:25 pm

This looks very cool, but it seems the RTextTools package has been taken of CRAN, so I can't load it. Do you know what the issue is?

Reply
Tom
7/19/2013 02:31:01 am

Great tutorial thanks. My one question is whether the weightings of words are case sensitive? For example, would 'Hello' have one weight and 'hello' another? Or does the weighting calculation just see these as the same word?

Reply
Isaac
9/27/2013 11:58:59 am

The code and explanation were spot on. Thanks!

Reply
jenn
10/27/2013 03:21:12 pm

Why do you still need to run LDA if the topics are already classified?

Reply
Tom R
3/5/2015 03:48:30 am

Thanks a lot for posting this. I was unaware of RTextTools, it makes for much easier creation of TDM matrices than qdap or tm (which is what I was using). This worked perfectly for me.

Reply
Himabindu Boddupalli link
6/16/2017 08:32:27 am

Where will I get the NYTimes dataset? I have found an NYT dataset but its composition is different.

Reply

Your comment will be posted after it is approved.


Leave a Reply.

    Developer Blog

    Updates on RTextTools progress, tips, and examples.

    ​Note: RTextTools is no longer actively maintained -- the software may contain bugs that will not be fixed with newer versions of R.

    By Author

    All
    Loren Collingwood
    Timothy P. Jurka

    By Date

    February 2012
    January 2012
    December 2011
    September 2011
    August 2011
    July 2011
    June 2011
    May 2011
    April 2011
    March 2011

    RSS Feed

Powered by Create your own unique website with customizable templates.