ggplot2:: Jitter plot in R using Titanic Dataset

A Jitter Plot is almost the same as the scatter plot that I discussed in my last post. It took me some time to figure out how it is different. So, here is my understanding explained with an example.
When we are adding more and more data points to a scatter plot, it starts losing its pattern. This is known as overplotting.
For example, if there are 100 data points at (5,5), it would appear as a single point at (5,5). When we have a lot of overlapping points, it can be difficult to get a sense of their density. So, in this kind of situation, adding some “jitter” can turn the overlapping points into deception and thus makes the density of points obvious.
Jittering is the act of adding random noise to data in order to prevent overplotting in statistical graphs. It adds a small amount of random variation to the location of each point and thus it is a useful way of handling overplotting.
Problem:
Create a jitter plot using R and the titanic Dataset.
Solution:
We will use the ggplot2 library to create our first Scatter Plot and the Titanic Dataset. The Data is first loaded and cleaned and the code for the same is posted here.
Now, let’s have a look at our current clean titanic dataset.

The jitter geom is a convenient shortcut for geom_point(position = “jitter”)

Let’s plot Age and Fare attributes to find a relationship the way we did in our Scatter Plot example. The only difference is we are going to use geom_jitter() for the same.

The graphs look exactly the same as the scatter plot. Now let’s add some color to the Datapoints as per the Sex of the passengers.

Still, same as the Scatter Pots! So how is it different?

I was trying to raise some questions examining the data. I asked, how many passengers are there who paid a fare greater than $500? Are they Male or Female? My above graph shows there are only 2 passengers in that category and both of them are Male. I tried to validate my data.

My query shows there are 3 passengers who paid a Fare greater than $500. One of the passengers is Female and the other two are Male. The Female passenger is age 35 and the two Male passengers are age 35 and 36 respectively. So, where is my third Female Datapoint in the above graph and how can we display that data point?

Since we are plotting Age vs Fare, the data points are (512,35), (512,36) and (512,35). So, there are 2 points at (512,35) and they are overlapping. Even, changing the color as per the Sex of the passenger did not show up the hidden point.

Now, let’s use geom_jitter() with position argument.

In the above plot, we are able to see three data points, 2 males and 1 female as expected. The above piece of code will generate slightly different plots for each run since the jitter is added randomly each time. Arguments width and height signifies the amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.

Although jittering can be a useful tool, it actually means adding additional noise to the data.

So, from a data visualization perspective, this additional variation can be misleading and can lead to misinterpretation of data.

Thank You!

0

Leave a Reply

Your email address will not be published. Required fields are marked *