Authors: Mateusz Kuzak, Diana Marek, Hedi Peterson
We will here using functions of ggplot2 package. There are basic ploting capabilities in basic R, but ggplot2 adds more powerful plotting capabilities.
Learning Objectives
- Visualise some of the mammals data from Figshare surveys.csv
- Understand how to plot these data using R ggplot2 package. For more details on using ggplot2 see official documentation.
- Building step by step complex plots with ggplot2 package
Load required packages
# plotting package
library(ggplot2)
#> Loading required package: methods
# piping / chaining
library(magrittr)
# modern dataframe manipulations
library(dplyr)
#>
#> Attaching package: 'dplyr'
#>
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#>
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Load data directly from figshare.
surveys_raw <- read.csv("data/surveys.csv")
surveys.csv
data contains some measurements of the animals caught in plots.
Let’s look at the summary
summary(surveys_raw)
There are few things we need to clean in the dataset.
There is missing species_id in some records. Let’s remove those.
surveys <- surveys_raw %>%
filter(species_id != "")
There are a lot of species with low counts, let’s remove the ones below 10 counts
# count records per species
species_counts <- surveys %>%
group_by(species_id) %>%
tally
# get names of those frequent species
frequent_species <- species_counts %>%
filter(n >= 10) %>%
select(species_id)
surveys <- surveys %>%
filter(species_id %in% frequent_species$species_id)
We saw in summary, there were NA’s in weight and hindfoot_length. Let’s remove rows with missing weights.
surveys_weight_present <- surveys %>%
filter(!is.na(weight))
Challenge
- Do the same to remove rows without
hindfoot_length
. Save results in the new dataframe.
surveys_length_present <- surveys %>%
filter(!is.na(hindfoot_length))
surveys_complete <- surveys_weight_present %>%
filter(!is.na(hindfoot_length))
We can chain filtering together using pipe operator (
%>%
) introduced earlier.
surveys_complete <- surveys %>%
filter(!is.na(weight)) %>%
filter(!is.na(hindfoot_length))
Make simple scatter plot of
hindfoot_length
(in millimeters) as a function ofweight
(in grams), using basic R plotting capabilities.
plot(x=surveys_complete$weight, y=surveys_complete$hindfoot_length)
We will make the same plot using ggplot2
package.
ggplot2
is a plotting package that makes it sipmple to create complex plots from data in a dataframe. It uses default settings, which help creating publication quality plotts with minimal amount of settings and tweaking.
With ggplot graphics are build step by step by adding new elements.
To build a ggplot we need to:
ggplot(surveys_complete)
aes
), that maps variables in the data to axes on the plot or to plotting size, shape color, etc.,ggplot(surveys_complete, aes(x = weight, y = hindfoot_length))
geoms
– graphical representation of the data in the plot (points, lines, bars). To add a geom to the plot use +
operator:ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point()
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha=0.1)
ggplot(surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha=0.1, color="blue")
Visualising the distribution of weight within each species.
ggplot(surveys_weight_present, aes(factor(species_id), weight)) +
geom_boxplot()
By adding points to boxplot, we can see particular measurements and the abundance of measurements.
ggplot(surveys_weight_present, aes(factor(species_id), weight)) +
geom_jitter(alpha=0.3, color="tomato") +
geom_boxplot(alpha=0)
Challenge
Create boxplot for
hindfoot_length
.
Let’s calculate number of counts per year for each species. To do that we need to group data first and count records within each group.
yearly_counts <- surveys %>%
group_by(year, species_id) %>%
tally
Timelapse data can be visualised as a line plot with years on x axis and counts on y axis.
ggplot(yearly_counts, aes(x=year, y=n)) +
geom_line()
Unfortunately this does not work, because we plot data for all the species together. We need to tell ggplot to split graphed data by species_id
ggplot(yearly_counts, aes(x=year, y=n, group=species_id)) +
geom_line()
We will be able to distiguish species in the plot if we add colors.
ggplot(yearly_counts, aes(x=year, y=n, group=species_id, color=species_id)) +
geom_line()
ggplot has a special technique called faceting that allows to split one plot into mutliple plots based on some factor. We will use it to plot one time series for each species separately.
ggplot(yearly_counts, aes(x=year, y=n, color=species_id)) +
geom_line() + facet_wrap(~species_id)
Now we wuld like to split line in each plot by sex of each individual measured. To do that we need to make counts in dataframe grouped by sex.
Challenges:
- filter the dataframe so that we only keep records with sex “F” or “M”s
sex_values = c("F", "M")
surveys <- surveys %>%
filter(sex %in% sex_values)
- group by year, species_id, sex
yearly_sex_counts <- surveys %>%
group_by(year, species_id, sex) %>%
tally
- make the faceted plot spliting further by sex (within single plot)
ggplot(yearly_sex_counts, aes(x=year, y=n, color=species_id, group=sex)) +
geom_line() + facet_wrap(~ species_id)
#> geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
We can improve the plot by coloring by sex instead of species (species are already in separate plots, so we don’t need to distinguish them better)
ggplot(yearly_sex_counts, aes(x=year, y=n, color=sex, group=sex)) +
geom_line() + facet_wrap(~ species_id)
#> geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?