Data visualisation with ggplot2

Data cleaning and preparing for plotting
Plotting with ggplot2
Modifying plots
Boxplot
Plotting time series data
Faceting

Authors: Mateusz Kuzak, Diana Marek, Hedi Peterson

Disclaimer

We will here using functions of ggplot2 package. There are basic ploting capabilities in basic R, but ggplot2 adds more powerful plotting capabilities.

Learning Objectives

Visualise some of the mammals data from Figshare surveys.csv

Understand how to plot these data using R ggplot2 package. For more details on using ggplot2 see official documentation.

Building step by step complex plots with ggplot2 package

Load required packages

# plotting package
library(ggplot2)

#> Loading required package: methods

# piping / chaining
library(magrittr)
# modern dataframe manipulations
library(dplyr)

#> 
#> Attaching package: 'dplyr'
#> 
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> 
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Load data directly from figshare.

surveys_raw <- read.csv("data/surveys.csv")

surveys.csv data contains some measurements of the animals caught in plots.

Data cleaning and preparing for plotting

Let’s look at the summary

summary(surveys_raw)

There are few things we need to clean in the dataset.

There is missing species_id in some records. Let’s remove those.

surveys <- surveys_raw %>%
           filter(species_id != "")

There are a lot of species with low counts, let’s remove the ones below 10 counts

# count records per species
species_counts <- surveys %>%
                  group_by(species_id) %>%
                  tally

# get names of those frequent species
frequent_species <- species_counts %>%
                    filter(n >= 10) %>%
                    select(species_id)

surveys <- surveys %>%
           filter(species_id %in% frequent_species$species_id)

We saw in summary, there were NA’s in weight and hindfoot_length. Let’s remove rows with missing weights.

surveys_weight_present <- surveys %>%
                      filter(!is.na(weight))

Challenge

Do the same to remove rows without hindfoot_length. Save results in the new dataframe.

surveys_length_present <- surveys %>%
                      filter(!is.na(hindfoot_length))

How would you get the dataframe without missing values?

surveys_complete <- surveys_weight_present %>%
                    filter(!is.na(hindfoot_length))

We can chain filtering together using pipe operator (%>%) introduced earlier.

surveys_complete <- surveys %>%
                    filter(!is.na(weight)) %>%
                    filter(!is.na(hindfoot_length))

Make simple scatter plot of hindfoot_length (in millimeters) as a function of weight (in grams), using basic R plotting capabilities.

plot(x=surveys_complete$weight, y=surveys_complete$hindfoot_length)

Plotting with ggplot2

We will make the same plot using ggplot2 package.

ggplot2 is a plotting package that makes it sipmple to create complex plots from data in a dataframe. It uses default settings, which help creating publication quality plotts with minimal amount of settings and tweaking.

With ggplot graphics are build step by step by adding new elements.

To build a ggplot we need to:

bind plot to a specific data frame

ggplot(surveys_complete)

define aestetics (aes), that maps variables in the data to axes on the plot or to plotting size, shape color, etc.,

ggplot(surveys_complete, aes(x = weight, y = hindfoot_length))

add geoms – graphical representation of the data in the plot (points, lines, bars). To add a geom to the plot use + operator:

ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
  geom_point()

Modifying plots

adding transparency (alpha)

ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
  geom_point(alpha=0.1)

adding colors

ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
  geom_point(alpha=0.1, color="blue")

Boxplot

Visualising the distribution of weight within each species.

ggplot(surveys_weight_present, aes(factor(species_id), weight)) +
                   geom_boxplot()

By adding points to boxplot, we can see particular measurements and the abundance of measurements.

ggplot(surveys_weight_present, aes(factor(species_id), weight)) +
                   geom_jitter(alpha=0.3, color="tomato") +
                   geom_boxplot(alpha=0)

Challenge

Create boxplot for hindfoot_length.

Plotting time series data

Let’s calculate number of counts per year for each species. To do that we need to group data first and count records within each group.

yearly_counts <- surveys %>%
                 group_by(year, species_id) %>%
                 tally

Timelapse data can be visualised as a line plot with years on x axis and counts on y axis.

ggplot(yearly_counts, aes(x=year, y=n)) +
                  geom_line()

Unfortunately this does not work, because we plot data for all the species together. We need to tell ggplot to split graphed data by species_id

ggplot(yearly_counts, aes(x=year, y=n, group=species_id)) +
  geom_line()

We will be able to distiguish species in the plot if we add colors.

ggplot(yearly_counts, aes(x=year, y=n, group=species_id, color=species_id)) +
  geom_line()

Faceting

ggplot has a special technique called faceting that allows to split one plot into mutliple plots based on some factor. We will use it to plot one time series for each species separately.

ggplot(yearly_counts, aes(x=year, y=n, color=species_id)) +
  geom_line() + facet_wrap(~species_id)

Now we wuld like to split line in each plot by sex of each individual measured. To do that we need to make counts in dataframe grouped by sex.

Challenges:

filter the dataframe so that we only keep records with sex “F” or “M”s

sex_values = c("F", "M")
surveys <- surveys %>%
           filter(sex %in% sex_values)

group by year, species_id, sex

yearly_sex_counts <- surveys %>%
                     group_by(year, species_id, sex) %>%
                     tally

make the faceted plot spliting further by sex (within single plot)

ggplot(yearly_sex_counts, aes(x=year, y=n, color=species_id, group=sex)) +
  geom_line() + facet_wrap(~ species_id)

#> geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

We can improve the plot by coloring by sex instead of species (species are already in separate plots, so we don’t need to distinguish them better)

ggplot(yearly_sex_counts, aes(x=year, y=n, color=sex, group=sex)) +
  geom_line() + facet_wrap(~ species_id)

#> geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

Data visualisation with ggplot2

Visualising data in R with ggplot2 package