RTextTools: a machine learning library for text classification
  • Blog
  • About the Project
  • Install
  • How to Cite
  • Documentation

Reduce Memory Use for Large Datasets

6/1/2011

0 Comments

 
One key limiting factor for automated text classification is memory consumption. As you accumulate more news articles, bills, and legal opinions, the term-document matrices used to represent the data grow quickly. RTextTools provides two algorithms, support vector machines and maximum entropy, that can handle large datasets with very little memory. Luckily, these two algorithms tend to be the most accurate as well. However, some applications require an ensemble of more than two algorithms to get an accurate scoring of topic codes.

First, you can try reducing the number of terms in your matrix. The create_matrix() function provides many features that can help remove noise from your dataset. There are the defaults- removing stopwords, removing punctuation, making words lowercase, and stripping whitespace- but also some other helpful tools. You can set minimum word length (e.g. minWordLength=5), select the N most frequent terms from each document (e.g. selectFreqTerms=25), setting a minimum word frequency per document (e.g. minDocFreq=3), and remove terms with a sparse factor of less than N (e.g. removeSparseTerms=0.9998).

These options can help you reduce the size of your document matrix, but they also can remove some information that may be valuable for the learning algorithms. If you just need the resources to run a huge dataset, and nothing above helps, you should look into setting up an Amazon EC2 instance with RStudio installed. We plan to create a simple way of doing this in the near future, but you'll have to brave the stormy waters for now. Be warned, this option is for experienced users only!
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

    Developer Blog

    Updates on RTextTools progress, tips, and examples.

    ​Note: RTextTools is no longer actively maintained -- the software may contain bugs that will not be fixed with newer versions of R.

    By Author

    All
    Loren Collingwood
    Timothy P. Jurka

    By Date

    February 2012
    January 2012
    December 2011
    September 2011
    August 2011
    July 2011
    June 2011
    May 2011
    April 2011
    March 2011

    RSS Feed

Powered by Create your own unique website with customizable templates.