Data wrangling - an applied example with corn hybrid and nitrogen rate effect on plant height
Initial notes
Hey there.
The exercise below is a bit more complex than the first post.
I envision two different ways of learning with this post:
You follow it from start to end, running the code on your R console and checking what is happening yourself, or
You may read sections 1 through 3 to understand the data set, and then jump right into whatever wrangling function you are trying to learn.
Either way, the dataset used and all the code are available below, so you should be able to run it on your own.
1) Script description
This script was developed to demonstrate an agriculturally applied data wrangling problem and solution.
Now, what is data wrangling?
From wikipedia:
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
Thus, be it manually with excel or even using pivot tables, you are likely doing data wrangling already.
Let me show you a reproducible, better, and faster alternative to doing it all in excel.
The specific objectives of this script are to:
Import a dataset from github and understand its current structure.
Re-format the dataset in order to have it in the proper format for plotting and analysis. For this, we will be using multiple data wrangling functions from the dplyr and tidyr packages, which are part of the tidyverse family of packages.
Learn to use the pipe operator (
%>%
) in a sequence of data wrangling verbs.
2) Setup
Here we load all necessary packages to be used during this exercise.
The packages below need to be installed the first time if you haven’t yet.
To install a package, use the function install.packages("packagename")
.
2.1) Packages description
Package tidyverse actually contains a group of packages that follow the same syntax standard. It includes the packages dplyr and tidyr, which are what we will be using the most for this exercise.
Package RCurl is needed in order to import the dataset that we will work with from my personal github account.
# Setting global chunk options
knitr::opts_chunk$set(echo = TRUE, tibble.width=Inf)
# Loading necessary packages
library(tidyverse)
library(RCurl)
2.2) Data import
Let’s import the dataset.
This dataset is saved on my personal github account, so we are going to use the getURL()
function to save its path into the df_path object.
After that point, we just use the read_csv()
function as we would normally.
Here, we are saving the data into an object called corn.
# Saving the URL path into the object df_path
df_path<- getURL("https://raw.githubusercontent.com/leombastos/datasets/master/corn.csv")
# Reading the data file and saving it to an object called corn
corn<- read_csv(df_path)
## Rows: 16 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Hybrid
## dbl (8): Rep, Nrate.lbac, Height.cm_V3, Height.cm_V7, Height.cm_V12, Height....
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset has been imported.
3) Data exploration
Now, let’s explore it a bit to understand its current structure.
# Printing the first 6 rows
corn %>%
head()
## # A tibble: 6 × 9
## Rep Hybrid Nrate.lbac Height.cm_V3 Height.cm_V7 Height.cm_V12 Height.cm_V16
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 A 50 7.6 35.7 85 184
## 2 1 B 50 7.4 24.8 84.7 173.
## 3 1 A 100 5.1 31.1 91.4 206.
## 4 1 B 100 7.4 17.2 75.8 180.
## 5 2 A 50 7 38.7 85.9 197
## 6 2 B 50 7.8 22.2 83.6 174.
## # … with 2 more variables: Height.cm_VT <dbl>, Height.cm_R4 <dbl>
This dataset includes information on a field experiment with corn.
The experimental design was randomized complete block with four replicates. The treatment design was a 2 hybrid x 2 N rate factorial.
This dataset contains 9 columns. They are:
- Rep: the replicate number (1 through 4).
- Hybrid: the corn hybrids evaluated (A or B).
- Nrate.lbac: the N application rates evaluated, in lbs/ac (50 and 100).
- Height.cm_STAGE: six columns containing the average height of 15 plants measured in each plot, in cm. STAGE is the crop growth stage of the crop at time of measurement (V3, V7, V12, V16, VT, R4).
Let’s take a look at how R interpreted the variable types:
corn %>%
glimpse()
## Rows: 16
## Columns: 9
## $ Rep <dbl> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4
## $ Hybrid <chr> "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "…
## $ Nrate.lbac <dbl> 50, 50, 100, 100, 50, 50, 100, 100, 50, 50, 100, 100, 50…
## $ Height.cm_V3 <dbl> 7.6, 7.4, 5.1, 7.4, 7.0, 7.8, 11.5, 6.1, 7.8, 6.4, 11.4,…
## $ Height.cm_V7 <dbl> 35.7, 24.8, 31.1, 17.2, 38.7, 22.2, 31.7, 26.3, 45.1, 22…
## $ Height.cm_V12 <dbl> 85.0, 84.7, 91.4, 75.8, 85.9, 83.6, 100.1, 96.8, 89.3, 8…
## $ Height.cm_V16 <dbl> 184.0, 173.3, 205.5, 179.6, 197.0, 174.3, 216.2, 186.9, …
## $ Height.cm_VT <dbl> 209.5, 195.2, 218.0, 192.3, 213.3, 190.6, 224.8, 195.8, …
## $ Height.cm_R4 <dbl> 189.6, 185.4, 222.4, 189.8, 208.2, 192.3, 219.3, 185.4, …
We see that the variable Rep is of type double, and Hybrid is of type character.
I want to change their type to factor. For that, let’s use the function factor()
.
# Changing variable type to factor
corn$Rep<-factor(corn$Rep)
corn$Hybrid<- factor(corn$Hybrid)
# Checking variable type again
corn %>%
glimpse()
## Rows: 16
## Columns: 9
## $ Rep <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4
## $ Hybrid <fct> A, B, A, B, A, B, A, B, A, B, A, B, A, B, A, B
## $ Nrate.lbac <dbl> 50, 50, 100, 100, 50, 50, 100, 100, 50, 50, 100, 100, 50…
## $ Height.cm_V3 <dbl> 7.6, 7.4, 5.1, 7.4, 7.0, 7.8, 11.5, 6.1, 7.8, 6.4, 11.4,…
## $ Height.cm_V7 <dbl> 35.7, 24.8, 31.1, 17.2, 38.7, 22.2, 31.7, 26.3, 45.1, 22…
## $ Height.cm_V12 <dbl> 85.0, 84.7, 91.4, 75.8, 85.9, 83.6, 100.1, 96.8, 89.3, 8…
## $ Height.cm_V16 <dbl> 184.0, 173.3, 205.5, 179.6, 197.0, 174.3, 216.2, 186.9, …
## $ Height.cm_VT <dbl> 209.5, 195.2, 218.0, 192.3, 213.3, 190.6, 224.8, 195.8, …
## $ Height.cm_R4 <dbl> 189.6, 185.4, 222.4, 189.8, 208.2, 192.3, 219.3, 185.4, …
We see that now Rep and Hybrid are of type factor. Great!
Now, let’s explore some of the wrangling functions.
4) Data wrangling
names(datasetname)
to have a list of the column names. This is useful when specifying column names throughout different wrangling functions below.4.1) filter()
(works on rows)
Let’s assume I would like to filter this dataset in a way that I only had data containing hybrid A and N rate of 50 lbs N/ac.
The wrangling function for that is filter()
.
Let’s apply it and save to an object called corn2.
# Filtering Hybrid=="A" and Nrate.lbac==50
corn2<-filter(.data = corn,
Hybrid=="A", Nrate.lbac==50)
# Checking corn2
corn2 %>%
head()
## # A tibble: 4 × 9
## Rep Hybrid Nrate.lbac Height.cm_V3 Height.cm_V7 Height.cm_V12 Height.cm_V16
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 A 50 7.6 35.7 85 184
## 2 2 A 50 7 38.7 85.9 197
## 3 3 A 50 7.8 45.1 89.3 188.
## 4 4 A 50 7.3 31.1 99.5 208.
## # … with 2 more variables: Height.cm_VT <dbl>, Height.cm_R4 <dbl>
Nice, now the object corn2 contains only data related to hybrid A and N rate of 50 lbs/ac.
4.2) select()
(works on columns)
Now suppose I want to select only columns containing information on hybrid and height at R4.
The wrangling function for that is select()
.
Let’s apply it and save to an object called corn3.
# Selecting only columns Hybrid and Height.cm_R4
corn3<- dplyr::select(.data = corn,
Hybrid, Height.cm_R4)
# Checking corn3
corn3 %>%
head()
## # A tibble: 6 × 2
## Hybrid Height.cm_R4
## <fct> <dbl>
## 1 A 190.
## 2 B 185.
## 3 A 222.
## 4 B 190.
## 5 A 208.
## 6 B 192.
There we have it!
4.3) mutate()
(creating new variables)
Let’s say we want to publish this work on an international journal that only accepts metric system units.
The unit of the N rate column Nrate.lbac is lbs/ac, part of the imperial system.
Let’s create a new column called Nrate.kgha, which will contain the N rate information in the metric system unit of kg/ha.
The wrangling function for that is mutate()
.
Let’s apply it and save to an object called corn4.
# Creating the variable Nrate.kgha by transforming Nrate.lbac
corn4<- mutate(.data = corn,
Nrate.kgha=round(Nrate.lbac*0.453/0.404,0))
# Checking corn4
corn4 %>%
select(Rep, Hybrid,
Nrate.lbac, Nrate.kgha) %>%
head()
## # A tibble: 6 × 4
## Rep Hybrid Nrate.lbac Nrate.kgha
## <fct> <fct> <dbl> <dbl>
## 1 1 A 50 56
## 2 1 B 50 56
## 3 1 A 100 112
## 4 1 B 100 112
## 5 2 A 50 56
## 6 2 B 50 56
The last column in corn4 is now Nrate.kgha, which contains the N fertilizer rate information in the metric system unit kg/ha. Great!
4.4) unite()
(join the values of two columns into one)
Now, I want to “linearize” the treatment design.
In other words, I would like to have one column that contains a different entry for each combination of Hybrid and Nrate.kgha.
The wrangling function for that is unite()
.
Let’s apply it and save to an object called corn5.
# Uniting Hybrid and Nrate.kgha in one variable called Hybrid_Nrate.kgha
corn5<- unite(data = corn4,
Hybrid_Nrate.kgha, Hybrid, Nrate.kgha,
remove = FALSE)
# Checking corn5
corn5 %>%
head()
## # A tibble: 6 × 11
## Rep Hybrid_Nrate.kgha Hybrid Nrate.lbac Height.cm_V3 Height.cm_V7
## <fct> <chr> <fct> <dbl> <dbl> <dbl>
## 1 1 A_56 A 50 7.6 35.7
## 2 1 B_56 B 50 7.4 24.8
## 3 1 A_112 A 100 5.1 31.1
## 4 1 B_112 B 100 7.4 17.2
## 5 2 A_56 A 50 7 38.7
## 6 2 B_56 B 50 7.8 22.2
## # … with 5 more variables: Height.cm_V12 <dbl>, Height.cm_V16 <dbl>,
## # Height.cm_VT <dbl>, Height.cm_R4 <dbl>, Nrate.kgha <dbl>
Now we have a new column called Hybrid_Nrate.kgha, which contains info on both Hybrid and N rate. Nice!
4.5) gather()
(to combine values of multiple columns)
Now, let me ask you this:
If we wanted to create a plot with crop stage on the x axis and crop height on the y axis, how would we do it? Do we already have the right columns for that?
Stop for a second here and try to figure out where we have the crop stage information as of now.
head(corn5)
to inspect the variable names and the first six entries.I’ll add a few empty lines here so you don’t see the answer right away.
Answer: the crop stage information is in the height column headings (e.g. Height.cm_V3).
How do we remove it from there and create a variable with it?
Because we have many height columns, it would be easier for us to first gather all of them into one column only.
The wrangling function for that is gather()
.
Let’s apply it and save to an object called corn6.
# Gathering all of the height columns into two new columns containing the column name (Variable) and column values (Height.cm)
corn6<-gather(data = corn5,
Variable, Height.cm, #name of the two new columns to be created.
-Rep,-Hybrid_Nrate.kgha,-Hybrid, # all variables that we
-Nrate.lbac, -Nrate.kgha) # DON'T want to gather (thus the -)
# Checking corn6 (selecting specific columns for viz purposes)
corn6 %>%
select(Hybrid, Nrate.kgha,
Variable, Height.cm) %>%
head()
## # A tibble: 6 × 4
## Hybrid Nrate.kgha Variable Height.cm
## <fct> <dbl> <chr> <dbl>
## 1 A 56 Height.cm_V3 7.6
## 2 B 56 Height.cm_V3 7.4
## 3 A 112 Height.cm_V3 5.1
## 4 B 112 Height.cm_V3 7.4
## 5 A 56 Height.cm_V3 7
## 6 B 56 Height.cm_V3 7.8
OK, so now we have column Variable, which contains values in the format “Height.cm_STAGE”.
Almost there.
4.6) separate()
(to separate a column into two or more)
Since the term “Height.cm” is just repeated across all values in Variable, we can look at it as “uselessinfo_STAGE”.
Let’s use this information to split uselessinfo from STAGE.
The wrangling function for that is separate()
.
Let’s apply it and save to an object called corn7.
# Separating the values in column Variable
# into two new columns (Var and Stage), based on the separator "_"
corn7<-separate(data =corn6,
Variable,
into=c("Var", "Stage"), #name of two new columns
sep="_", remove = FALSE)
# Checking corn7 (selecting specific columns for viz purposes)
corn7 %>%
select(Hybrid, Nrate.kgha,
Variable, Var, Stage, Height.cm) %>%
head()
## # A tibble: 6 × 6
## Hybrid Nrate.kgha Variable Var Stage Height.cm
## <fct> <dbl> <chr> <chr> <chr> <dbl>
## 1 A 56 Height.cm_V3 Height.cm V3 7.6
## 2 B 56 Height.cm_V3 Height.cm V3 7.4
## 3 A 112 Height.cm_V3 Height.cm V3 5.1
## 4 B 112 Height.cm_V3 Height.cm V3 7.4
## 5 A 56 Height.cm_V3 Height.cm V3 7
## 6 B 56 Height.cm_V3 Height.cm V3 7.8
Now we have a column called Stage, which contains only the crop stage information!
4.7) group_by()
and summarise()
Now, suppose I want to calculate summary statistics (like mean and standard deviation) for all combinations between Hybrid and Nrate.kgha.
If I simply use mean(corn7$Height.cm)
, it gives the overall mean averaged over all cells.
However, we can use a function to create the groupings that we want before calling the mean()
function.
The wrangling function for that is group_by()
.
Let’s apply it and save to an object called corn.means.
corn.means<- group_by(.data = corn7,
Hybrid, Nrate.kgha) #variables to group
# Checking corn.means
corn.means
## # A tibble: 96 × 9
## # Groups: Hybrid, Nrate.kgha [4]
## Rep Hybrid_Nrate.kgha Hybrid Nrate.lbac Nrate.kgha Variable Var Stage
## <fct> <chr> <fct> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 A_56 A 50 56 Height.cm_V3 Heig… V3
## 2 1 B_56 B 50 56 Height.cm_V3 Heig… V3
## 3 1 A_112 A 100 112 Height.cm_V3 Heig… V3
## 4 1 B_112 B 100 112 Height.cm_V3 Heig… V3
## 5 2 A_56 A 50 56 Height.cm_V3 Heig… V3
## 6 2 B_56 B 50 56 Height.cm_V3 Heig… V3
## 7 2 A_112 A 100 112 Height.cm_V3 Heig… V3
## 8 2 B_112 B 100 112 Height.cm_V3 Heig… V3
## 9 3 A_56 A 50 56 Height.cm_V3 Heig… V3
## 10 3 B_56 B 50 56 Height.cm_V3 Heig… V3
## # … with 86 more rows, and 1 more variable: Height.cm <dbl>
Hmm, it looks just like corn7.
However, on the background, it changed slightly in its structure.
Let’s check it.
# Class type of corn7
class(corn7)
## [1] "tbl_df" "tbl" "data.frame"
# Class type of corn.means
class(corn.means)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
We can see how corn.means has an extra class type named “grouped_df”.
What that means is that corn.means has been grouped as specified above.
However, since we have not yet performed any action on the grouping, the dataset remained unchanged.
Now, we just need to ask for variables to be calculated based on the grouping.
The wrangling function for that is summarise()
.
Let’s apply it to calculate n (number of observations being summarized at each grouping level), mean, and standard deviation (sd), and save to an object called corn.means2.
corn.means2<- summarise(corn.means,
n=length(Height.cm),
mean=mean(Height.cm),
sd=sd(Height.cm))
## `summarise()` has grouped output by 'Hybrid'. You can override using the
## `.groups` argument.
# Checking corn.means2
corn.means2
## # A tibble: 4 × 5
## # Groups: Hybrid [2]
## Hybrid Nrate.kgha n mean sd
## <fct> <dbl> <int> <dbl> <dbl>
## 1 A 56 24 123. 83.3
## 2 A 112 24 132. 92.3
## 3 B 56 24 112. 79.0
## 4 B 112 24 113. 81.0
We have successfully calculated summary variables for each level of the grouping that was specified (Hybrid and Nrate.kgha). Nice!
You may ask “why did you calculate n?”.
I normally calculate n when I’m using the group_by()/summarise()
combo as a sanity check.
It helps me assess whether or not the correct number of observations went into the summarized variables.
Sometimes you may incorrectly specify something, and if you don’t have a way to check, you may never realize it.
In this case, n=24 because it has averaged over Rep (4 levels) x Stage (6 levels) = 24 observations.
n=length()
in your group_by()/summarise()
combos as a sanity check.5) Bringing it all together with pipes (%>%
)
All right, that was a lot of wrangling functions!
Did you realize how many intermediate datasets we created, one at each wrangling step?
It was a total of 9 intermediate objects from step 4.1 through 4.7.
If only there was a way we could combine all that into a single operation.
Well, there is!
And it is done by using the pipe operator %>%
.
The pipe operator basically passes what is on its left side as the input for what is on its right side.
Because of this behavior, when using the pipe, we only specify the dataset used before the first pipe, and DO NOT need to specify the data =
argument of the consecutive functions in a pipe stream.
Check the example below, where I created a pipe flow that includes all steps from 4.3 through 4.7, and saved it to an object called corn.means.piped.
Note how I only specified the dataset corn in the first row, and how the data =
argument for the consecutive functions has been omitted.
# Redoing steps 4.3 through 4.7 using the pipe operator
corn.means.piped<- corn %>%
mutate(Nrate.kgha=round(Nrate.lbac*0.453/0.404,0)) %>%
unite(Hybrid_Nrate.kgha, Hybrid, Nrate.kgha, remove = FALSE) %>%
gather(Variable, Height.cm,
-Rep,-Hybrid_Nrate.kgha,-Hybrid,
-Nrate.lbac, -Nrate.kgha) %>%
separate(Variable,
into=c("Var", "Stage"), sep="_",
remove = FALSE) %>%
group_by(Hybrid, Nrate.kgha) %>%
summarise(n=length(Height.cm),
mean=mean(Height.cm),
sd=sd(Height.cm))
## `summarise()` has grouped output by 'Hybrid'. You can override using the
## `.groups` argument.
# Checking corn.means.piped
corn.means.piped %>% head()
## # A tibble: 4 × 5
## # Groups: Hybrid [2]
## Hybrid Nrate.kgha n mean sd
## <fct> <dbl> <int> <dbl> <dbl>
## 1 A 56 24 123. 83.3
## 2 A 112 24 132. 92.3
## 3 B 56 24 112. 79.0
## 4 B 112 24 113. 81.0
Nice!
Since the object corn.means.piped was created basically with the same functions as object corn.means2, they should have the exact same content.
We can check that by using the function identical()
.
If both datasets are 100% the same, the function returns TRUE
.
If even only one cell is different, it returns FALSE
.
Let’s check it ourselves.
# Checking corn.means.piped
identical(corn.means2, corn.means.piped)
## [1] TRUE
It returned TRUE
, which means that both objects have the exact same content.
Now we have gotten our dataset wrangled to meet our analysis and plotting needs (to be included in a future post!).
That’s it for today!
Now you can start using R for your data wrangling projects and benefit from its reproducibility, reliabilitiy, and speed.
6) Summary
This script demonstrated how to:
- Import a dataset from github
- Use multiple data wrangling functions including
filter()
to subset rows
select()
to subset columns
mutate()
to create new variables
unite()
to join the values of two columns
gather()
to combine values of multiple columns
separate()
to separate a column name into two or more columns
group_by()
to create groupings based on specific columns
summarise()
to extract summary variables from grouped data
- Use the pipe operator (
%>%
) to create a continuous wrangling stream
Before we depart, I just wanted to point to a great data wrangling resource called the RStudio Data Wrangling Cheatsheet.
This cheat sheet includes quick demonstrations of how to use not only the functions we learned today, but also many others. Check it out!
I hope you have enjoyed this post!
Please let me know in the comments below if you have any questions about this script, any suggestions to improve it, and any suggestions for future posts.
Happy wrangling!