Scatter Plots

R is known for producing graphics, and quickly producing graphics is a wonderful way to explore a dataset. Plots are generated by providing data to plotting functions. The first plotting function we’ll discuss is plot(). The plot() function can take many different arguments but most of them have default values.

The most simple way to use plot() is just providing the function with one argument, a numeric vector:

plot(c(6, 10, 15, 21, 28))

How easy is that!? As you can see the index of each element of the vector is plotted on the x-axis and the value of each element of the vector is plotted on the y-axis. If you provide two different vectors as arguments to plot() the first vector will determine the x-coordinates and the second vector will determine the y-coordinates:

# Let's make a data frame with some fake average temperature data
avg_temp <- data.frame(
  baltimore = c(25.5, 24.5, 22.5, 25.9, 16.2, 31.9, 30.0, 24.6, 29.8, 23.7),
  chicago = c(22.3, 27.0, 22.5, 22.3, 30.1, 29.0, 26.2, 24.5, 23.4, 28.0),
  date = 14:23
)

plot(avg_temp$date, avg_temp$baltimore)

Let’s make this plot look a little nicer. First we’ll add labels for the x and y axes by specifying the arguments xlab and ylab:

plot(avg_temp$date, avg_temp$baltimore, xlab = "Date", ylab = "Temperature")

We can add a title to the plot by using the main argument:

plot(avg_temp$date, avg_temp$baltimore, xlab = "Date", ylab = "Temperature",
     main = "Average Temperateure in Baltimore Over 10 Days")

Line Plots

Since we’re trying to see how the average temperature changed over time, perhaps this would look better as a line graph. We can turn this into a line graph by specifying the type argument:

plot(avg_temp$date, avg_temp$baltimore, xlab = "Date", ylab = "Temperature",
     main = "Average Temperateure in Baltimore Over 10 Days", type = "l")

We made some fake data for temperatures in Chicago too, so let’s add those values to this plot. We can do this using the lines() function, which allows us to layer a line on top of a graph we’ve already created. Notice that the lines() function takes the same arguments as the plot() function:

plot(avg_temp$date, avg_temp$baltimore, xlab = "Date", ylab = "Temperature",
     main = "Average Temperateure Over 10 Days", type = "l")
lines(avg_temp$date, avg_temp$chicago)

Uh-oh, there’s no way to differentiate the two lines! First let’s use the col argument to color the line for temperatures in Chicago:

plot(avg_temp$date, avg_temp$baltimore, xlab = "Date", ylab = "Temperature",
     main = "Average Temperateure Over 10 Days", type = "l")
lines(avg_temp$date, avg_temp$chicago, col = "red")

That’s much better! Now let’s add a legend so we know which line corresponds to which city. We can layer on a legend using the legend function. The legend function can potentially take many arguments, but most of the arguments have defaults. Each argument below has a comment which explains its purpose.

plot(avg_temp$date, avg_temp$baltimore, xlab = "Date", ylab = "Temperature",
     main = "Average Temperateure Over 10 Days", type = "l")
lines(avg_temp$date, avg_temp$chicago, col = "red")
legend("bottomleft",              # specifies the location of the legend
       c("Baltimore", "Chicago"), # Adds strings to the legend
       lty = 1,                   # Draws a line next to each string
       col = c("black", "red"),   # colors each line
       inset=c(.02,.02))          # moves the legend away from each axis a little

Histograms

Histograms are great for visualizing counts and examining how frequencies of events are distributed. We can use the hist() function to make histrograms. The hist() function can take many parameters but like plot() we can get away with just providing a single vector as an argument:

hist(c(2, 2, 4, 4, 4, 9, 9, 9))

It’s difficult to show the usefulness of hist() using fake data like, so let’s pull in some of the lego set data we worked with in lecture 2 so we can visualize how the number of minifigures per set is distributed:

legos <- read.csv("https://raw.githubusercontent.com/seankross/lego/master/data-tidy/legosets.csv", stringsAsFactors = FALSE)

hist(legos$Minifigures)

We can specify many of the same arguments when using hist() as we did when using plot(), like xlab for labelling the x-axis and main for adding a title.

hist(legos$Minifigures, xlab = "Number of Minifigs", 
     main = "Lego Minifigures Per Set")

If we want a more granular look at the number of minifigures per set we can increase the number of bars in the plot by adding the breaks argument:

hist(legos$Minifigures, xlab = "Number of Minifigs", 
     main = "Lego Minifigures Per Set", breaks = 30)

Boxplots

Another type of plot for examining distributions and identifying outliers is a boxplot. Let’s make a simple boxplot using the Baltimore temperature data from before:

boxplot(avg_temp$baltimore)

By providing two vectors, we can see two boxplots next to each other:

boxplot(avg_temp$baltimore, avg_temp$chicago)

Like in previous exampels we can add names, axis labels, and a title:

boxplot(avg_temp$baltimore, avg_temp$chicago, 
        names = c("Baltimore", "Chicago"), 
        main = "Temperature in Two Cities Over 10 Days",
        ylab = "Temperature")

Bar Plots

#

Multiple Plots

You can produce a figure containing multiple plots using the par() function. To make two plots side by side, use par(mfrow=c(1,2)):

par(mfrow=c(1,2))
plot(1:10)
hist((1:10)^2)

To make to put four plots in one figure use par(mfrow=c(2,2)):

par(mfrow=c(2,2))
plot(1:10)
hist((1:10)^2)
boxplot(1:10)
barplot(1:10)

To return to making one plot at a time use par(mfrow=c(1,1)):

par(mfrow=c(1,1))
plot(1:10)


Home