While attending RStudio::conf 2020 a couple people asked me for a very lightweight introduction to the Tidyverse, a collection of some of the most useful R packages that are designed to work well together. I thought I would try writing the most concise introduction possible for people who are in a hurry or only want a little taste of what is out there.
First install the Tidyverse:
Then load the Tidyverse:
If you have used R before you probably know how to create sequences of
integers with the colon (
If you wanted to know the sum of all of the numbers between 1 and 10 you might write some code like:
The Tidyverse makes heavy use of the pipe operator (
%>%), which takes
whatever is on the left-hand-side of the pipe and makes it the first argument
of whatever function is on the right-hand-side of the pipe.
1:10 %>% sum() is equivalent to
The Tidyverse is largely concerned with manipulating data frames. One of my
favorite data frames that is built into R is called
trees. The Tidyverse has
its own kind of data frame called a tibble. Let’s turn
trees into a tibble
for our future convenience:
I put the above code in parentheses so that you could see how the resulting
tibble prints nicely. Notice that the dimensions of the rows and columns of
this tibble are printed at the top where it says
# A tibble: 31 x 3.
The Volume variable represents the volume of the tree in cubic feet. Let’s
create a new column that specifies whether it is a small tree
mutate(). Let’s say that a small tree has a Volume less than 19.
In the code below you can specify that the name of the new column will be
Imagine that you only want to look at trees that have a Height greater than 70.
You can use
filter() to eliminate rows from a tibble:
Notice how the dimensions printed at the top of the tibble changed from
# A tibble: 31 x 4 to
# A tibble: 25 x 4 since we eliminated several rows.
Let’s say you are interested in how the Height of a tree is related to whether
or not it is a small tree. In this case we only want to look at the columns
Height and Small_Tree, and you can keep only these columns using
Now say you want to compare the Height of small trees and larger trees. You can
group_by() function to specify groups within a column. As you can
see it does not do much on its own:
To actually make a grouped calculation we need to use
summarize(). In the
example below we are going to calculate the mean Height for each group, and
we will assign one value per group into a new column called Avg_Height. Finally,
ungroup() once you are finished with a grouped calculation. You should
ungroup() like open and closed parentheses. You
would not have one without the other.
This is just the snowflake at the tip of the iceberg of the Tidyverse. If you are interested in learning more I highly recommend the learning resources on the Tidyverse website.