The First Basic Random Forest Algorithm in R Using Titanic DataSet- My DataScience Notebook

Random Forest is a popular algorithm for Machine Learning which can be used both for Regression and Classification tasks. The concept (definition) of this algorithm is explained in its name itself.
Forest – It develops a forest with lots of decision trees.
Random – For building the decision trees, the Data is selected on random and even the variables are selected at random, each time. So, each subset can have different sizes or elements or variables and the subsets can overlap or may not overlap with each other.

In Classification problems, to determine the class of an object, a vote from each of the decision trees is considered and chooses the one with most votes.
In Regression problems, the average of the outputs from all of the different trees is considered for prediction.

Applications of Randon Forest Algorithm:

Banking: Random Forest has been widely used in Banking Sectors for classifying Loyal Customers and Fraud Customers.
Stock Market: In the stock Market, Random Fores Algorithm can be used to identify Stock Behaviour. It can be used to predict expected loss or gain on purchasing a particular stock.
E-commerce: In- E-commerce, Random Forest can be used in recommending the customers based on similar kinds of searches.
Now, let’s start with a simple example of the Random Forest Algorithm.

Problem:

Create a Random Forest Model in R

Solution:

To create our first Random Forest Algorithm in R, we need to install “randomForest” package.

install.packages("randomForest")

Let’s follow the below steps to build our first Model:

1. Pick a Dataset:

For this example, we will use the same Titanic Dataset from Kaggle as we had used before for the Logistic Regression Example. Below is a brief description of the Data:

Variable	Definition	Values
survival	Survival	0= No, 1 = Yes
pclass	Ticket class	1= 1st, 2= 2nd, 3= 3rd
sex	Sex	Male/Female
Age	Age in years
sibsp	No of siblings/spouses aboard the Titanic
parch	No of parents/children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

2. Load Data:

First, we will Load the “Train.csv” and “Test.csv” Datafiles using the function read.csv(). While loading the Datafiles, we will use na.strings = “” so that the missing values in the data are loaded as NA.We will bind these 2 Datasets into one for further Data Cleaning.

3. Data Cleaning:

To keep this example simple, we are not going to add any new features. We will do the following cleanups instead:

Remove a few variables which may not be beneficial for our analysis.
Check for the NA values and replace the NA values with some meaningful data.

4. Create the Model:

To create our first Random Forest Model, we need to divide the dataset into Train and Test as we did earlier. We will use randomForest() function for this example.

If we follow the above steps and try to create our first Random Forest Model, we may get an error “NAs introduced by coercion” like below:

This error occurs if there is be any variable in the Dataset with class ‘char’. Please check here for the solution.

5. Understand the Model:

Once, the Random Forest Model is developed without any error, we can see analyze our current Model.

Meaning of the parameters used above in the Model:
Importance: Should importance of predictors be assessed?
Proximity: Should proximity measure among the rows be calculated?
ntree: Number of trees to grow

So, the output shows that it is a Classification Model and Number of variables that were tried at each split is 2. If we check the Plot, the error is gradually getting stable after 100 trees are iterated.

If we check the Variable Importance Plots, there are two types of Importance measures.

The accuracy one tests to see how worse the model performs without a variable. That means a high decrease in accuracy would be expected for a highly predictive variable.

The Gini one measures how pure the nodes are at the end of the tree. It tests to see the result if each variable is taken out. That simply means a variable with a high score is highly important.

In both of the above plots, it seems that the Sex of the Passenger is highly Important in deciding if the passenger survived or not.

6. Predict Survival for the Test Dataset:

Now, let’s use the above Model to Predict Survival in the Test DataSet. To predict the survival, we need to use the Predict().

As per this model, 133 passengers survived and 285 passengers were dead.

Thank You!