Let us take a sample dataset and clean the data with the concepts that we learned in Dplyr and Tidyr package.
Problem:
Use Dplyr and Tidyr package to clean the NYC 2013 Flights dataset.
Solution:
For this assignment, we can use the data sets related to flights that departed from NYC in 2013.
Below is the output dataset:
0 ![]() |
Flights |
Below is the code that I have used to clean the dataset:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library(nycflights13) | |
library(tidyr) | |
library(dplyr) | |
# Check the Dataset flights | |
View(flights) | |
#----------------------------------------------------------------- | |
# Use Unite() to merge month, day and year columns into single column called fly_date | |
clean_flights <- unite(flights,fly_date, month,day,year, sep = "-") | |
#----------------------------------------------------------------- | |
# Use arrange() to order data by origin and destination | |
clean_flights <- arrange(clean_flights,origin, dest) | |
#----------------------------------------------------------------- | |
# Use mutate() to remove the columns which are not required | |
clean_flights <- mutate(clean_flights,dep_delay = NULL, | |
arr_delay= NULL, | |
air_time = NULL, | |
time_hour = NULL) | |
# Use mutate() to add a new column "Total_Time" | |
clean_flights <- mutate(clean_flights,total_time = hour* 60 + minute) | |
# Use mutate() to add a new column "Speed" | |
clean_flights <- mutate(clean_flights, speed = distance/total_time * 60) | |
#----------------------------------------------------------------- | |
# Use filter() to select flights from EWR to SFO which took less than 10 hours | |
clean_flights <- filter(clean_flights,origin == "EWR", dest == "SFO", hour < 10) | |
# Order Data in ascending order of Total_time | |
clean_flights <- arrange(clean_flights,total_time) | |
View(clean_flights) |
Below is the output dataset:
![]() |
Clean_Flights |