RTextTools has largely been used for topic classification in the social sciences. However, recent discussions with researchers at various universities have demonstrated that the package can be applied to a host of problems in the natural sciences as well.

One such application is using text classification to identify breast cancer masses as benign or malignant. Using the Wisconsin Diagnostic Breast Cancer Dataset from UC Irvine, we wrote a script that trains eight classifiers on characteristics such as clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. When run on the data, the classifiers were able to achieve up to 96% recall accuracy on a randomly sampled training set of 200 patients and test set of 400 patients.

The source code is available below, and the dataset is automatically downloaded from UC Irvine's servers. If you've found RTextTools useful in your research, we'd love to hear about it!

2/12/2012 13:19:51

Error: could not find function "create_matrix"

2/12/2012 15:20:16

This error indicates that the package is not installed on your system. The command within R is install.packages("RTextTools"). Please make sure you have R 2.14+.

3/21/2012 17:14:14

I get the same error, i.e. "Error: could not find function "create_matrix" but I most definitely do have the RTextTools package installed and I am running R 2.14.2.

I have tried running under both the 32 and 64 bit versions of R.

I also get this error: might this be the problem ?

"Loading required package: Rstem
Error: package ‘Rstem’ could not be loaded
In addition: Warning message:
In library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc = lib.loc) :
there is no package called ‘Rstem’"

3/21/2012 17:23:01

Hi Scott,

Rstem is a dependency of RTextTools, so if it isn't installed RTextTools won't work. It appears that the package was recently removed from CRAN ( http://cran.r-project.org/web/packages/Rstem/index.html ).

You can install Rstem by running the following command in R:

install.packages("Rstem", repos = "http://www.omegahat.org/R", type="source")

Then install RTextTools using install.packages("RTextTools")

3/21/2012 17:30:42

Wow !!! You took about 3 min to respond to my query and you are probably in the middle of the night there (I'm based in Melbourne, Australia).

Tried to install Rstem and got this error: does it mean I need to use the 32 bit version of R ?

* installing *source* package 'Rstem' ...
** libs

*** arch - i386
ERROR: compilation failed for package 'Rstem'
* removing 'C:/Users/sm/Documents/R/win-library/2.14/Rstem'

3/21/2012 17:44:19

Hi Scott,

Only 5:43pm here in California! :)

It's possible you don't have Rtools ( http://www.murdoch-sutherland.com/Rtools/ ) installed on your machine and therefore can't compile the package from source. If you run the install command without the "type" parameter it should install correctly:

install.packages("Rstem", repos = "http://www.omegahat.org/R")


Scott MacLean
3/21/2012 18:03:36

OK that seems to have worked:

"package ‘Rstem’ successfully unpacked and MD5 sums checked"

But after re-installing RTextTools and topicmodels I am still getting this error, even when running R in its 32 bit version (on a 64 bit machine).

"Loading required package: Rstem
Error: package ‘Rstem’ is not installed for 'arch=i386'"

I have a 32 bit machine as well, so I'll try setting up everything there instead.

3/21/2012 18:13:16

That indicates the Rstem was not installed under 32-bit R. Make sure to install all the packages in the same version of R.

Scott MacLean
3/21/2012 18:32:08

I've switched to my 32 bit machine, uninstalled R2.10.something that was on there, installed R 2.14.2, then run the command to install Rstem without the 'type' parameter and everything seemed fine - 'Rstem' successfully unpacked and MD5 sums checked.

Then I loaded the RTextTools and topicmodels packages.

But then when I do library (RTextTools) I still get the error 'RStem not installed for 'arch = i386'.

This is weird. It's a 386 machine. I give up :-(

Won't bother you again, but thanks so much for your help and responses.

3/21/2012 21:06:31

Aaagggghhhh ... finally got it fixed.

But only by ignoring the way R usually installs new packages, and instead just copied over the Rstem files manually into the directory that R appears to want them in.

So now I have the 'create_matrix' command being accepted. Now all I have to do is make it work :-)

2/16/2012 15:20:56

How does someone get that data in order to score a new data? For example, where do you get the specific factors mentioned ("uniformity of cell shape/size")? from a biopsy? if so, then what's the point of a model as the biopsy tells you whether it's cancer or not. If not, then where can you get those factors? surely not from an ultrasound. Thanks.

2/25/2012 18:17:15

Features are automatically computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Once trained on a subset that a physician has already diagnosed, a computer can classify new cell nuclei as malignant or benign.

Michele Filannino
3/12/2012 10:42:14


thank for sharing the script. I imported the csv file in Weka and I found that a RandomForest classifier with 10 iterations performs at 95.7% (F-Score).

Probably is better to try the tool on a different data set. Regardless it, thank you very much for sharing your tests.


3/12/2012 11:00:47

Michele, thank you for sharing! Clearly, there are many ways to approach this problem. I'll look into how the randomForest classifier performs by itself in RTextTools.

Michele Filannino
3/12/2012 11:05:53

Voilà! :)
Slot "algorithm_summary":
2 0.99 0.96 0.97
4 0.93 0.98 0.95

They seem to be coherent.

Caragea Amy
4/3/2012 14:12:57

Hi Timothy,

Thanks for sharing this script.

I am not quite sure the reason of "ADD TEXTUAL DESCRIPTORS FOR EACH MASS CHARACTERISTIC FOR THE DOCUMENT-TERM MATRIX" in your script. Will that affect the classification result without adding textual descriptors? Thanks.

4/3/2012 14:29:23

This script utilizes RTextTools to classify the cancer masses, which is a text classification tool. Therefore, it creates a matrix with each observation and the frequency of values for each observation. However, a value of 1 for clump thickness is different than a value of 1 for bare nuclei. Unless we label the value, we only have the frequencies of the numbers 1-10 in the matrix, without knowing what they correspond to.

We could just as easily add textual descriptors such as the letters A-I, but I used descriptive labels for the purpose of this example.


7/5/2012 17:30:35

Hi Timothy,

For the part "ADD TEXTUAL DESCRIPTORS FOR EACH MASS CHARACTERISTIC FOR THE DOCUMENT-TERM MATRIX", do you mean to add some artificial text description to test the text classification function?


7/5/2012 17:34:07

Hi Ying,

Correct. This allows the algorithms to distinguish the characteristics when processing the term-document matrix.


12/6/2012 00:35:46

Hi Timothy,

I am trying to use your code but I keep getting the following error when running this command:

matrix <- create_matrix(training_data, language="english", removeNumbers=FALSE, stemWords=FALSE, removePunctuation=FALSE, weighting=weightTfIdf)

Error in append(control, list(tokenize = scan_tokenizer), after = 4) :
object 'scan_tokenizer' not found

I am using R 2.15.2 64bit on Windows 7 64bit.

Any help would be greatly appreciated.


12/6/2012 11:19:33

This error indicates that the package "tm" was not properly installed, and it is a dependency of RTextTools. Try running install.packages("tm") and run the script again.

12/12/2012 04:12:07

Hi Timothy,

You were right. It turned out that tm package was not properly installed. A simple re-install (> install.packages("tm", dependencies=TRUE)) did the trick.

Thanks again, and apologies for the delayed reply ;)

P.S Excellent work! Keep it up.

3/1/2013 09:33:32

this code is different than the code that appeared on r-bloggers. That code has a function called create_corpus which isn't in this code.

6/12/2013 08:55:39


Is there a way get the confusion matrix for each algorithm, it seems the library won't support that ?

6/12/2013 11:04:15

Seems like I have to it one by, however package won't provide a direct way to do that.

table(container@testing_codes,results$SVM_LABEL ) would give confusion table SVM.

11/27/2013 16:56:31

I got following error when I was installing.

Error : package 'class' was built before R 3.0.0: please re-install it
ERROR: lazy loading failed for package 'e1071'

Alex P
1/22/2014 04:22:44

Can you please help with the following error?
(the code below comes from create_container() documentation)

> library(RTextTools)
> data(NYTimes)
> data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
> matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english",removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
> container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100,virgin=FALSE)
Error in x$nrow : $ operator is invalid for atomic vectors

2/1/2014 14:15:53

I was curious why logistic regression wouldn't be used for this data? Are the features such as clump size, shape, etc not already scalar values?
Perhaps it is my misunderstanding of the numerical data. If the numbers represent classes (type 1, type 2, shape 4, etc), then I would understand why logistic may not work.

Tim Jurka
2/1/2014 18:08:05

Logistic regression is used as one of the algorithms in this example. Maxent, short for maximum entropy, is a multinomial logistic regression classifier. Multinomial logistic regression is performed by doing K-1 binomial logits (where K is the number of labels). If you have a binary dependent variable, this means you'll be doing one (K-1) logistic regression. This post was simply to demonstrate RTextTools as a supervised learning toolkit that can help you explore different algorithms.

7/20/2015 12:57:40

Hi Tim, thanks for creating this incredibly useful package. I'm still learning and would like to know if there's a way to extract feature weights from the various algorithms. In other words, in the breast cancer example above, is there a way to figure out how important size is relative to thickness or shape in determining whether a tumor is malignant or benign?


Your comment will be posted after it is approved.

Leave a Reply.