A core focus of the RTextTools project has been to make the package as accessible and user-friendly as possible. In its early iterations, the package contained dependencies such as RWeka, openNLP, and Snowball which, at least for our developers, did not present any challenges. However, as soon as we distributed the package to our beta testers, problems began cropping up all over the place: Java was not installed, the incorrect architecture of Java was installed, users were running out of heap space, or getting cryptic warning messages during runtime. The decision was made early on to make RTextTools a 100% Java-free installation, no matter what had to be done.
This decision has presented considerable challenges, because many natural language processing tools on CRAN are available exclusively in packages that require rJava: Porter stemmers require package Snowball, the only decent maximum entropy classifier requires openNLP, and n-gram tokenizing requires RWeka. Although there has been some success finding alternatives, such as using Rstem instead of Snowball, other features had to be written from the ground up in C++ to replace their counterparts (see package maxent). Even Rstem had its issues as it was only available on Omegahat, but luckily Duncan Temple Lang was willing to submit it to CRAN.
So why, you might ask, am I so adamantly opposed to mixing R and Java when it makes the developer's life easier? To be clear, I have nothing against Java as a programming language and I've used it extensively to develop Android apps and web apps. However, the integration of R and Java introduces a whole level of complexity that, in my opinion, should be absent from a statistical language such as R.
We started experiencing problems when RWeka and openNLP were first bundled with RTextTools during the alpha stages. These issues were generally installation problems: Java was not installed on the machine, or the wrong architecture of Java installed (32-bit vs 64-bit), or in the case of Linux users, the JRE was installed but not the SDK. But the problems didn't stop there. We began hearing of "java.lang.OutOfMemoryError: Java heap space" errors from users that were running AdaBoost from the RWeka package on massive training matrices. It turns out that Java defaults to 512MB of heap space, and these users were quickly exceeding the default settings. Additionally, some of the packages, RWeka in particular, were displaying cryptic error messages when a component was not properly installed.
Any software engineer can power through these problems in minutes, but when distributing an R package to thousands of users with varying levels of experience, these problems turn into frustration and wasted man-hours. It finally dawned on me that if I was expecting R users to figure out all these Java technicalities, they might as well use the Weka and openNLP packages directly in Eclipse.
Most readers will think I'm going to absurd lengths to avoid Java-- you are correct. In the end, the point of R, in my opinion, is to do away with all the technicalities of installation, managing memory, and interpreting error messages, and focus on applying the package's functionality to your project. Introducing Java to the equation impedes this goal, which is why I urge R developers to avoid using rJava whenever possible, even if it means taking the harder route.
This decision has presented considerable challenges, because many natural language processing tools on CRAN are available exclusively in packages that require rJava: Porter stemmers require package Snowball, the only decent maximum entropy classifier requires openNLP, and n-gram tokenizing requires RWeka. Although there has been some success finding alternatives, such as using Rstem instead of Snowball, other features had to be written from the ground up in C++ to replace their counterparts (see package maxent). Even Rstem had its issues as it was only available on Omegahat, but luckily Duncan Temple Lang was willing to submit it to CRAN.
So why, you might ask, am I so adamantly opposed to mixing R and Java when it makes the developer's life easier? To be clear, I have nothing against Java as a programming language and I've used it extensively to develop Android apps and web apps. However, the integration of R and Java introduces a whole level of complexity that, in my opinion, should be absent from a statistical language such as R.
We started experiencing problems when RWeka and openNLP were first bundled with RTextTools during the alpha stages. These issues were generally installation problems: Java was not installed on the machine, or the wrong architecture of Java installed (32-bit vs 64-bit), or in the case of Linux users, the JRE was installed but not the SDK. But the problems didn't stop there. We began hearing of "java.lang.OutOfMemoryError: Java heap space" errors from users that were running AdaBoost from the RWeka package on massive training matrices. It turns out that Java defaults to 512MB of heap space, and these users were quickly exceeding the default settings. Additionally, some of the packages, RWeka in particular, were displaying cryptic error messages when a component was not properly installed.
Any software engineer can power through these problems in minutes, but when distributing an R package to thousands of users with varying levels of experience, these problems turn into frustration and wasted man-hours. It finally dawned on me that if I was expecting R users to figure out all these Java technicalities, they might as well use the Weka and openNLP packages directly in Eclipse.
Most readers will think I'm going to absurd lengths to avoid Java-- you are correct. In the end, the point of R, in my opinion, is to do away with all the technicalities of installation, managing memory, and interpreting error messages, and focus on applying the package's functionality to your project. Introducing Java to the equation impedes this goal, which is why I urge R developers to avoid using rJava whenever possible, even if it means taking the harder route.