Learning Logistic Regression in R

Logistic Regression is a technique to predict a Categorical Variable Outcome based on one or more input variable(s). Logistic Regression is a classification algorithm. For binary categorical outcomes like 0/1 or TRUE/FALSE or YES/NO values, we can use the Binomial Logistic Regression Model. There is a concept of Multinomial Logistic Regression Model which we may use to classify Films as “Horror”, “Drama” and “Romantic”. We can learn about it sometime later. Let’s first concentrate on our Binomial Model.
Problem:
Create a Logistic Regression Model Example in R
Solution:
Below are the steps that we will follow to create our first Logistic Regression Model.
1. Pick a DataSet:
For this example, we are going to use the Kaggle Titanic Datasets. There are two Datasets “Train.csv” and “Test.csv”. We are going to build a Logistic Regression Model using the Training Set. Then we will use the Model to predict Survival Probability for each passenger in the Test Dataset. 
Below is a brief description of the Data:
Variable
Definition
Values
survival
Survival
0= No, 1 = Yes
pclass
Ticket class
1= 1st, 2= 2nd, 3= 3rd
sex
Sex
Male/Female
Age
Age in years
 
sibsp
No of siblings/spouses aboard the Titanic
 
parch
No of parents/children aboard the Titanic
 
ticket
Ticket number
 
fare
Passenger fare
 
cabin
Cabin number
 
embarked
Port of Embarkation
C = Cherbourg,
Q = Queenstown,
S = Southampton
2. Load Data:
First, we will Load the “Train.csv” and “Test.csv” Datafiles using the function read.csv(). While loading the Datafiles, we will use na.strings = “” so that the missing values in the data are loaded as NA.We will bind these 2 Datasets into one for further Data Cleaning.



3. Data Cleaning:

To keep this example simple, we are not going to add any new features. We will do the following cleanups instead:

  • Remove a few variables which may not be beneficial for our analysis.
  • Check for the NA values and replace the NA values with some meaningful data.

4. Create a Model:

To create our first Logistic Regression Model, we need to divide the dataset into Train and Test as we had earlier. We will use glm() (Generalized Linear Model) and Family= Binomial for this example.

5. Understand The Model:

Let’s look at the Coefficients of the above Model and try to understand the meaning of the predictor variables.

  • It looks like the variables “Pclass”,”Sex” and “Age” are highly significant in building the above Model. Predictor “SibSp” is also significant. But, “Parch” and “Embarked” has less or no significance at all in this Model. 
  • The Estimate column for all the variables is Negative. That means, for example, if all other variables being equal, the MALE passenger is less likely to have survived.
  • The P-value is lowest for the Age variable and it signifies a strong association with the probability of being survived.
  • AIC Value is the measure of the quality of the Model. The preferred model is the one with Minimum AIC Value.
6. Predict Survival for Test Data:

Now, let’s use the above Model to Predict Survival in the Test DataSet. To predict survival, we need to use the Predict(). 

If we check the summary of our prediction, it looks like the Minimum value is 0 and the Max value is 1. That means we are calculating the probability of survival of a passenger and the value is between 0 and 1. In this example, 156 passengers did survive and 262 passengers did not.

The Prediction value and the efficiency of the Model can be improved by adding some more features and trying adding or removing some variables. I have added my detailed analysis for Titanic Dataset in Github.

Thank You!

0

5 comments

  1. website hosting services

    Hey are using WordPress for your site platform? I’m new to
    the blog world but I’m trying to get started and create my own.
    Do you require any html coding knowledge to make your own blog?

    Any help would be really appreciated!

Leave a Reply

Your email address will not be published. Required fields are marked *