RTextTools: a machine learning library for text classification
  • Blog
  • About the Project
  • Install
  • How to Cite
  • Documentation

Classifying Breast Cancer as Benign or Malignant Using RTextTools

2/11/2012

30 Comments

 
RTextTools has largely been used for topic classification in the social sciences. However, recent discussions with researchers at various universities have demonstrated that the package can be applied to a host of problems in the natural sciences as well.

One such application is using text classification to identify breast cancer masses as benign or malignant. Using the Wisconsin Diagnostic Breast Cancer Dataset from UC Irvine, we wrote a script that trains eight classifiers on characteristics such as clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. When run on the data, the classifiers were able to achieve up to 96% recall accuracy on a randomly sampled training set of 200 patients and test set of 400 patients.

The source code is available below, and the dataset is automatically downloaded from UC Irvine's servers. If you've found RTextTools useful in your research, we'd love to hear about it!

30 Comments
aaa001
2/12/2012 05:19:51 am

Error: could not find function "create_matrix"

Reply
Timothy P. Jurka link
2/12/2012 07:20:16 am

This error indicates that the package is not installed on your system. The command within R is install.packages("RTextTools"). Please make sure you have R 2.14+.

Reply
Scott MacLean link
3/21/2012 10:14:14 am

I get the same error, i.e. "Error: could not find function "create_matrix" but I most definitely do have the RTextTools package installed and I am running R 2.14.2.

I have tried running under both the 32 and 64 bit versions of R.


I also get this error: might this be the problem ?

"Loading required package: Rstem
Error: package ‘Rstem’ could not be loaded
In addition: Warning message:
In library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc = lib.loc) :
there is no package called ‘Rstem’"

Reply
Timothy P. Jurka link
3/21/2012 10:23:01 am

Hi Scott,

Rstem is a dependency of RTextTools, so if it isn't installed RTextTools won't work. It appears that the package was recently removed from CRAN ( http://cran.r-project.org/web/packages/Rstem/index.html ).

You can install Rstem by running the following command in R:

install.packages("Rstem", repos = "http://www.omegahat.org/R", type="source")

Then install RTextTools using install.packages("RTextTools")

Scott MacLean link
3/21/2012 10:30:42 am

Wow !!! You took about 3 min to respond to my query and you are probably in the middle of the night there (I'm based in Melbourne, Australia).

Tried to install Rstem and got this error: does it mean I need to use the 32 bit version of R ?

* installing *source* package 'Rstem' ...
** libs

*** arch - i386
ERROR: compilation failed for package 'Rstem'
* removing 'C:/Users/sm/Documents/R/win-library/2.14/Rstem'

Timothy P. Jurka link
3/21/2012 10:44:19 am

Hi Scott,

Only 5:43pm here in California! :)

It's possible you don't have Rtools ( http://www.murdoch-sutherland.com/Rtools/ ) installed on your machine and therefore can't compile the package from source. If you run the install command without the "type" parameter it should install correctly:

install.packages("Rstem", repos = "http://www.omegahat.org/R")

Best,
Tim

Scott MacLean
3/21/2012 11:03:36 am

OK that seems to have worked:

"package ‘Rstem’ successfully unpacked and MD5 sums checked"

But after re-installing RTextTools and topicmodels I am still getting this error, even when running R in its 32 bit version (on a 64 bit machine).

"Loading required package: Rstem
Error: package ‘Rstem’ is not installed for 'arch=i386'"

I have a 32 bit machine as well, so I'll try setting up everything there instead.

Timothy P. Jurka link
3/21/2012 11:13:16 am

That indicates the Rstem was not installed under 32-bit R. Make sure to install all the packages in the same version of R.

Scott MacLean
3/21/2012 11:32:08 am

I've switched to my 32 bit machine, uninstalled R2.10.something that was on there, installed R 2.14.2, then run the command to install Rstem without the 'type' parameter and everything seemed fine - 'Rstem' successfully unpacked and MD5 sums checked.

Then I loaded the RTextTools and topicmodels packages.

But then when I do library (RTextTools) I still get the error 'RStem not installed for 'arch = i386'.

This is weird. It's a 386 machine. I give up :-(

Won't bother you again, but thanks so much for your help and responses.

Scott
3/21/2012 02:06:31 pm

Aaagggghhhh ... finally got it fixed.

But only by ignoring the way R usually installs new packages, and instead just copied over the Rstem files manually into the directory that R appears to want them in.

So now I have the 'create_matrix' command being accepted. Now all I have to do is make it work :-)

Copula
2/16/2012 07:20:56 am

How does someone get that data in order to score a new data? For example, where do you get the specific factors mentioned ("uniformity of cell shape/size")? from a biopsy? if so, then what's the point of a model as the biopsy tells you whether it's cancer or not. If not, then where can you get those factors? surely not from an ultrasound. Thanks.

Reply
Timothy P. Jurka link
2/25/2012 10:17:15 am

Features are automatically computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Once trained on a subset that a physician has already diagnosed, a computer can classify new cell nuclei as malignant or benign.

Reply
Michele Filannino
3/12/2012 03:42:14 am

Hi,

thank for sharing the script. I imported the csv file in Weka and I found that a RandomForest classifier with 10 iterations performs at 95.7% (F-Score).

Probably is better to try the tool on a different data set. Regardless it, thank you very much for sharing your tests.

Bye,
michele.

Reply
Timothy P. Jurka link
3/12/2012 04:00:47 am

Michele, thank you for sharing! Clearly, there are many ways to approach this problem. I'll look into how the randomForest classifier performs by itself in RTextTools.

Reply
Michele Filannino
3/12/2012 04:05:53 am

Voilà! :)
---------------------------------------------------------------
Slot "algorithm_summary":
CLASS FORESTS_PRECISION FORESTS_RECALL FORESTS_FSCORE
2 0.99 0.96 0.97
4 0.93 0.98 0.95
---------------------------------------------------------------

They seem to be coherent.

Reply
Caragea Amy
4/3/2012 07:12:57 am

Hi Timothy,

Thanks for sharing this script.

I am not quite sure the reason of "ADD TEXTUAL DESCRIPTORS FOR EACH MASS CHARACTERISTIC FOR THE DOCUMENT-TERM MATRIX" in your script. Will that affect the classification result without adding textual descriptors? Thanks.

Reply
Timothy P. Jurka link
4/3/2012 07:29:23 am

This script utilizes RTextTools to classify the cancer masses, which is a text classification tool. Therefore, it creates a matrix with each observation and the frequency of values for each observation. However, a value of 1 for clump thickness is different than a value of 1 for bare nuclei. Unless we label the value, we only have the frequencies of the numbers 1-10 in the matrix, without knowing what they correspond to.

We could just as easily add textual descriptors such as the letters A-I, but I used descriptive labels for the purpose of this example.

Best,
Tim

Reply
Ying
7/5/2012 10:30:35 am

Hi Timothy,

For the part "ADD TEXTUAL DESCRIPTORS FOR EACH MASS CHARACTERISTIC FOR THE DOCUMENT-TERM MATRIX", do you mean to add some artificial text description to test the text classification function?

Thanks,
Ying

Reply
Timothy P. Jurka link
7/5/2012 10:34:07 am

Hi Ying,

Correct. This allows the algorithms to distinguish the characteristics when processing the term-document matrix.

Best,
Tim

Reply
OQ
12/5/2012 04:35:46 pm

Hi Timothy,

I am trying to use your code but I keep getting the following error when running this command:

matrix <- create_matrix(training_data, language="english", removeNumbers=FALSE, stemWords=FALSE, removePunctuation=FALSE, weighting=weightTfIdf)

Error in append(control, list(tokenize = scan_tokenizer), after = 4) :
object 'scan_tokenizer' not found

I am using R 2.15.2 64bit on Windows 7 64bit.

Any help would be greatly appreciated.

Thanks,
OQ

Reply
Timothy P. Jurka link
12/6/2012 03:19:33 am

This error indicates that the package "tm" was not properly installed, and it is a dependency of RTextTools. Try running install.packages("tm") and run the script again.

Reply
OQ
12/11/2012 08:12:07 pm

Hi Timothy,

You were right. It turned out that tm package was not properly installed. A simple re-install (> install.packages("tm", dependencies=TRUE)) did the trick.

Thanks again, and apologies for the delayed reply ;)

P.S Excellent work! Keep it up.

MySchizoBuddy
3/1/2013 01:33:32 am

this code is different than the code that appeared on r-bloggers. That code has a function called create_corpus which isn't in this code.

Reply
dmag
6/12/2013 01:55:39 am

Hi,

Is there a way get the confusion matrix for each algorithm, it seems the library won't support that ?

Reply
dmag
6/12/2013 04:04:15 am

Seems like I have to it one by, however package won't provide a direct way to do that.

table(container@testing_codes,results$SVM_LABEL ) would give confusion table SVM.

Reply
okugami
11/27/2013 08:56:31 am

I got following error when I was installing.

Error : package 'class' was built before R 3.0.0: please re-install it
ERROR: lazy loading failed for package 'e1071'

Reply
Alex P
1/21/2014 08:22:44 pm

Hello
Can you please help with the following error?
(the code below comes from create_container() documentation)

> library(RTextTools)
> data(NYTimes)
> data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
> matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english",removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
> container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100,virgin=FALSE)
Error in x$nrow : $ operator is invalid for atomic vectors

Reply
Jeff
2/1/2014 06:15:53 am

I was curious why logistic regression wouldn't be used for this data? Are the features such as clump size, shape, etc not already scalar values?
Perhaps it is my misunderstanding of the numerical data. If the numbers represent classes (type 1, type 2, shape 4, etc), then I would understand why logistic may not work.

Reply
Tim Jurka
2/1/2014 10:08:05 am

Logistic regression is used as one of the algorithms in this example. Maxent, short for maximum entropy, is a multinomial logistic regression classifier. Multinomial logistic regression is performed by doing K-1 binomial logits (where K is the number of labels). If you have a binary dependent variable, this means you'll be doing one (K-1) logistic regression. This post was simply to demonstrate RTextTools as a supervised learning toolkit that can help you explore different algorithms.

Reply
Sonya
7/20/2015 05:57:40 am

Hi Tim, thanks for creating this incredibly useful package. I'm still learning and would like to know if there's a way to extract feature weights from the various algorithms. In other words, in the breast cancer example above, is there a way to figure out how important size is relative to thickness or shape in determining whether a tumor is malignant or benign?

Reply

Your comment will be posted after it is approved.


Leave a Reply.

    Developer Blog

    Updates on RTextTools progress, tips, and examples.

    ​Note: RTextTools is no longer actively maintained -- the software may contain bugs that will not be fixed with newer versions of R.

    By Author

    All
    Loren Collingwood
    Timothy P. Jurka

    By Date

    February 2012
    January 2012
    December 2011
    September 2011
    August 2011
    July 2011
    June 2011
    May 2011
    April 2011
    March 2011

    RSS Feed

Powered by Create your own unique website with customizable templates.