Text Analytics is interesting but challenging. I started with a simple goal to create a “WordCloud” using R. I thought of using the datasets from the Kaggle competition “Sentiment Analysis on Movie Reviews“. But I got challenged at each and every step.
First, I got an error while loading the .tsv files. The details are here. I resolved that issue and finally loaded the required library for Text Mining “tm“. Below is the code to load the training dataset.
Next, I learned that I have to create a Corpus first because “The main structure for managing documents in tm is a so-called Corpus, representing a collection of text documents“. So, I used the below code and got an error.
movies_corpus <- Corpus((movies$Phrase))
The error is as below:
Error: inherits(x, “Source”) is not TRUE
I was not very clear about the concept of Corpus and then an error. Some investigation is now mandatory!
What is Corpus?
A VCorpus means “Volatile” corpus which implies that the corpus is stored in memory and would be gone when the R object containing it is destroyed.
The syntax for creating such a corpus is as below:
VCorpus(x, readerControl)
x:
a Source object which abstracts the input location.
tm:
provides a set of predefined source.
getSources():
lists the available sources, and users can create their own sources.VectorSource is for character vector only.
readerControl:
a list of the named components of the reader and language. Again tm provides a set of predefined readers and getReaders() lists the up-to-date list of available readers. Each source has a default reader which can be overridden.
Now, coming back to my error it says “inherits(x, “Source”) is not TRUE”. It is something about the Source argument. Since I am passing character values, let me try the below code:
movies_corpus <- Corpus(VectorSource(movies$Phrase)) moview_corpus
It worked!
So, the above code created a SimpleCorpus of 156060 documents.
There is a lot more information about the tm package here.
Thank You!