TextAnalytics: Create WordCloud in R

Text Analytics is such an interesting topic and I was trying to understand where to start from.
The first thing that caught my attention in any blogs or tutorials about Text Analytics is the “WordCloud”. Immediately I thought, I just want one.
So, I started with the Kaggle competition “Sentiment Analysis on Movie Reviews“. I was stuck at the first step itself while loading the Data files in “.tsv” format.
I fixed the errors and loaded the files finally. The post regarding the error and the solution is here.
Now, let’s start our Text Analytics adventure and create a WordCloud like below:

Problem:

Create our first WordCloud using R

Solution:

For Text Mining, one useful library is “tm”. To stem text, we will use the library “SnowballC” and to create the worcloud, we will use the library “WordCloud”.

Now, we will go deeper. But first, let’s install all these libraries.

install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")

Now, let’s divide the whole process into the below steps:

1. Load Data Files:

The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset.

Each phrase has a PhraseId.

Each sentence has a SentenceId.

  • train.tsv contains the phrases and their associated sentiment labels.
  • test.tsv contains just phrases.

For this post, we will load the training dataset alone and will create a word cloud.

There are 156060 observations and 4 variables. The reviews are stored in the “Phrase” variable. In this post, we will work with this variable alone.

2. Data Preprocessing:
In this step, we will do a series of operations to simplify the text. The library tm has several functions that help to clean up the text and create a list of individual words and the frequency with which those words were used.

2.1. Create a Corpus
If we read the documentation of the “tm” library, it says “The main structure for managing documents in tm is a so-called Corpus, representing a collection of text documents.”
If you are looking for more information, you can read here.
We need to use the function VectorSource and the reasons are explained in my previous post.  So let’s create a corpus first.

Now, let’s take a look at the first review, and then finally we will see what transformation we have done.

2.2. Convert all text To Lowercase

The corpus is then run through a series of tm_map functions. getTrasformations() function lists the predefined mappings that can be used with tm_map().

“tolower” will convert the text to lower case. Otherwise, the wordcloud might highlight capitalized words separately. So, the first alphabet “A” is converted to lowercase “a”.

2.3. Remove Punctuations

Since we are dealing with the Text here, we can remove all the punctuations by using the argument removePunctuation. It removes punctuation (periods, commas, hyphens, apostrophes etc. ). The comma(,) before some and the stop (.) at the end is removed.

2.4. Remove Stop Words

Stopwords are the commonly used words in the English language such as I, me, my, etc. We can see the full list of stopwords using stopwords(‘english’).

We need to pass removeWords and the stopwords(“english”) to remove the most commonly used words and below isthe result.

2.5. Remove Numbers:

removeNumbers will remove the numerical number i.e. 1(and not one) in case there are any numbers in the text since numbers are not that useful to create a wordcloud. Our first movie review has no numbers in this example.

2.6. Remove the White Space:

Now, after removing some words and punctuations, some white spaces are created in between which we can remove bypassing the argument stripWhitespace.

2.7. stemDocument

stemDocument is supposed to convert all the words to their stem i.e. if there are two words like “walking” and “walked”, stemDocument is supposed to convert them to the root word “walk”. “tm” library also has a predefined transformation “stemDocument”. “SnowballC” library is also a useful library for stemming. But honestly, I do not understand this completely and I am not happy with the result. I need to dig in more into this stemDocument concept. But for the time being, let’s create our first word cloud.

3. Create Document Term Matrix:

Document matrix is a table containing the frequency of the words. Column names are words and row names are documents. The function “TermDocumentMatrix()” or “DocumentTermMatrix()” from “tm” package can be used as follow :

A matrix is created with all the words and the number of times each of them being used in the documents. If we check the summary of the “freq” variable, we could see each word is being used at least once and the maximum frequency is 7787!

4. Create a WordCloud:

Finally, the most awaited step – To create a word cloud with the frequently used or the most important words. Function wordcloud() from the library “wordcloud” is used here.
If we just pass the arguments words and freq, in the function wordcloud(), a black and white word cloud would be created as below:

Now, let’s add a few more arguments and create a colorful word cloud:

Some of the important Arguments for the function wordcloud() is as below:

0

2 comments

Leave a Reply

Your email address will not be published. Required fields are marked *