Problem:
I was trying my hands on the Kaggle competition “Sentiment Analysis on Movie Reviews” and was stuck at the first step itself. The Data files are in “.tsv” format, which is tab-delimited. I tried “read.table” to load the files and got the below error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 40 elements
The error says line1 did not have 40 elements. Well, that is true!
The line1 is the header and have “PhraseId”, “SentenceId”,”Phrase” and “Sentiment”.
Solution:
Usually, if I am stuck in R, the first thing I do is to read the help file for that command. That’s why I ran the below command:
?read.table
The help file says read.table()
“reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file”.
Well, that’s what I thought and that is the reason I am trying to use this function. I skipped a few lines and continued reading further and got another important information
“read.delim and read.delim2 are for reading delimited files, defaulting to the TAB character for the delimiter”.
That is useful! I used the below statement and it worked like magic!
train <- read.delim("train.tsv",stringsAsFactors = FALSE)
Still, something was bothering me. What was the error for and why it was expecting 40 elements in the first line? I re-read the help file for read.table() and the answer was there itself.
It says
“The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer. This could conceivably be wrong if fill or blank.lines.skip are true, so specify col.names if necessary”.
I tried to pass the argument fill = true as below in read.table().
train_data <- read.table("train.tsv",stringsAsFactors = FALSE, sep = "t", fill = TRUE)
The above command worked and created the data frame train_data. But it got the below warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :EOF within quoted string
Now, how to get rid of this warning message? I was searching if the help file is telling anything about quoted string and got this information
“To disable quoting altogether use quote = “”.”
So, I tried the below read.table() command and it worked without any error or warning. yay!
train_data <- read.table("train.tsv",stringsAsFactors = FALSE, sep = "t", fill = TRUE, quote = "")
Thank You for reading!
1