library(tidyverse)
Hands-On Session: Math Camp Recap
In this detailed hands-on session, we will walk through some fundamental data manipulation tasks in R using the starwars
dataset from the dplyr
package, covering essential functions for summarizing, filtering, arranging, and visualizing data. Throughout the session, we will practice different ways to compute summary statistics, filter data, and generate plots, focusing on applying both base R and tidyverse syntax. The goal is to consolidate what you learned during Math Camp while expanding on the key techniques for data analysis.
We start by loading the necessary libraries, specifically tidyverse
, which offers a collection of R packages designed for data science. This package will allow us to manipulate and visualize data in a clear and concise manner.
Next, Let’s calculate the total height of all characters in the dataset. We use the base R function sum
to compute the sum of the height
variable. This calculation includes the na.rm = TRUE
argument to exclude missing values from the summation. Don’t worry thinking whether there is an intrinsic meaning or not to the output. We are doing this just for instructional purposes.
<- sum(starwars$height, na.rm = TRUE)
total_height total_height
[1] 14143
After obtaining the total height, we move to calculating the average height using the base R function mean
. Again, the na.rm = TRUE
argument ensures that missing values do not affect the calculation.
<- mean(starwars$height, na.rm = TRUE)
avg_height avg_height
[1] 174.6049
Now, we introduce the tidyverse
syntax to achieve the same result, using the summarize
function within a pipeline to calculate the average height. This approach is more readable and scalable for complex operations.
|>
starwars summarize(
avg_height = mean(height, na.rm = TRUE)
)
# A tibble: 1 × 1
avg_height
<dbl>
1 175.
Next, let’s compute the Body Mass Index (BMI) for each character by applying a transformation to the data. We use the mutate
function to create a new variable, bmi
, which is derived from the mass
and height
variables. Then, we select the characters’ names and their respective BMI values for display.
<- starwars |>
starwars mutate(bmi = mass / (height / 100)^2)
|>
starwars select(name, bmi)
# A tibble: 87 × 2
name bmi
<chr> <dbl>
1 Luke Skywalker 26.0
2 C-3PO 26.9
3 R2-D2 34.7
4 Darth Vader 33.3
5 Leia Organa 21.8
6 Owen Lars 37.9
7 Beru Whitesun Lars 27.5
8 R5-D4 34.0
9 Biggs Darklighter 25.1
10 Obi-Wan Kenobi 23.2
# ℹ 77 more rows
We can filter the dataset to identify tall characters whose height exceeds 200 cm. The filter
function allows us to subset the data based on this condition.
<- starwars |>
tall_characters filter(height > 200)
tall_characters
# A tibble: 10 × 15
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Darth V… 202 136 none white yellow 41.9 male mascu…
2 Chewbac… 228 112 brown unknown blue 200 male mascu…
3 Roos Ta… 224 82 none grey orange NA male mascu…
4 Rugor N… 206 NA none green orange NA male mascu…
5 Yarael … 264 NA none white yellow NA male mascu…
6 Lama Su 229 88 none grey black NA male mascu…
7 Taun We 213 NA none grey black NA fema… femin…
8 Grievous 216 159 none brown, wh… green, y… NA male mascu…
9 Tarfful 234 136 brown brown blue NA male mascu…
10 Tion Me… 206 80 none grey black NA male mascu…
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, bmi <dbl>
Once we have the subset of tall characters, we sort them in descending order of height using the arrange
function, and then display only their names and heights. The desc
function ensures that the tallest characters appear at the top.
|>
tall_characters arrange(desc(height)) |>
select(name, height)
# A tibble: 10 × 2
name height
<chr> <int>
1 Yarael Poof 264
2 Tarfful 234
3 Lama Su 229
4 Chewbacca 228
5 Roos Tarpals 224
6 Grievous 216
7 Taun We 213
8 Rugor Nass 206
9 Tion Medon 206
10 Darth Vader 202
We then perform grouping and summarizing to calculate the average height for each species in the dataset. The group_by
function groups the data by species, and summarise
computes the average height within each group.
<- starwars |>
avg_height group_by(species) |>
summarise(
avg_height = mean(height, na.rm = TRUE)
)
avg_height
# A tibble: 38 × 2
species avg_height
<chr> <dbl>
1 Aleena 79
2 Besalisk 198
3 Cerean 198
4 Chagrian 196
5 Clawdite 168
6 Droid 131.
7 Dug 112
8 Ewok 88
9 Geonosian 183
10 Gungan 209.
# ℹ 28 more rows
In another example, we use group_by
and mutate
to create a new column mean_grouped
, which stores the average height for each species directly within the dataset. This doesn’t modify the level of the dataset.
|>
starwars group_by(species) |>
mutate(
mean_grouped = mean(height, na.rm = TRUE)
)
# A tibble: 87 × 16
# Groups: species [38]
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, bmi <dbl>, mean_grouped <dbl>
To find the tallest species, we can arrange the summarized dataset by average height in descending order, extract the top entry, and use the pull
function to display only the species name.
|>
avg_height arrange(desc(avg_height)) |>
head(1) |>
pull(species)
[1] "Quermian"
Finally, let’s use ggplot2 for visualization. First, we create a bar chart to display the tallest characters by height. The ggplot
function initializes the plot, and geom_col
adds the bars, with black borders and gray fill. We also use labs
to label the axes and add a title to the plot, and theme_minimal
to apply a clean, minimalistic theme.
|>
tall_characters ggplot(
aes(
x = height,
y = name
)+
) geom_col(color = "black", fill = "grey") +
labs(
x = "Height",
y = NULL,
title = "Tallest characters in Starwars"
+
) theme_minimal()
In another visualization, we reorder the character names by height using fct_reorder
from the forcats
package, ensuring that the tallest characters are displayed at the top of the plot. The structure is similar to the previous plot, but here we include reordering to enhance clarity.
|>
tall_characters mutate(name = fct_reorder(name, height)) |>
ggplot(
aes(
x = height,
y = name
)+
) geom_col(color = "black", fill = "grey") +
labs(title = "Top 10 tallest characters in this dataset") +
theme_minimal()
All the best with your first Problem Set!