00:30
API209: Summer Math Camp
September 6, 2024
R installed?
Current version 4.4.1
RStudio installed?
I’m on RStudio 2024.04.2+764 – This one has Quarto already installed.
Have these packages?
tidyverse
. For the PSet, you may use the sf
package for maps.
So our first chunk (or lines of code) should look like this:
00:30
We can call the starwars
dataset by its name.
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
You know already some functions to check the data.
✗sum(var)
✓sum(dataset$var, na.rm = TRUE
)
✗starwars |> sum(height)
✗starwars |> sum(height, na.rm = TRUE)
00:30
✗mean(var)
✓mean(dataset$var, na.rm = TRUE
)
✗starwars |> mean(height)
✗starwars |> mean(height, na.rm = TRUE)
00:30
[1] 14143
sum()
function to calculate the total height of all characters in the dataset. The na.rm = TRUE
option ensures missing values are ignored.<-
) the result to an object.In this specific exercise, we don’t really care about the result since there is no actual meaning.
What do we use if we want to create new variables?
mutate
. Let’s use mass
and height
from the dataset to create a bmi
variable. You can google the formula if you don’t know how to estimate the bmi.
01:30
Which function we use to subset our dataset (from the tidyverse
package)?
filter
tall_characters
.01:30
# Filter characters with height greater than 200
tall_characters <- starwars |>
filter(height > 200)
tall_characters
# A tibble: 10 × 15
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Darth V… 202 136 none white yellow 41.9 male mascu…
2 Chewbac… 228 112 brown unknown blue 200 male mascu…
3 Roos Ta… 224 82 none grey orange NA male mascu…
4 Rugor N… 206 NA none green orange NA male mascu…
5 Yarael … 264 NA none white yellow NA male mascu…
6 Lama Su 229 88 none grey black NA male mascu…
7 Taun We 213 NA none grey black NA fema… femin…
8 Grievous 216 159 none brown, wh… green, y… NA male mascu…
9 Tarfful 234 136 brown brown blue NA male mascu…
10 Tion Me… 206 80 none grey black NA male mascu…
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>, bmi <dbl>
The filter()
function is used to select characters whose height is greater than 200 cm.
Use the same object, i.e., tall_characters
to sort the characters. Number 1 should be the tallest chracter.
01:30
Extra optional question:
What’s the difference between select
and filter
?
Now, imagine we would like to know the average height by species in this universe. How do we do it?
01:30
Now, imagine we would like to know the average height by species in this universe. How do we do it?
# Group by species and summarize average height
avg_height <- starwars |>
group_by(species) |>
summarise(avg_height = mean(height, na.rm = TRUE))
avg_height
# A tibble: 38 × 2
species avg_height
<chr> <dbl>
1 Aleena 79
2 Besalisk 198
3 Cerean 198
4 Chagrian 196
5 Clawdite 168
6 Droid 131.
7 Dug 112
8 Ewok 88
9 Geonosian 183
10 Gungan 209.
# ℹ 28 more rows
ggplot2
Let’s visualize the tallest characters. use the object tall_characters
to create a plot of the character name (categorical, y axis) vs their height (x axis).
05:00
ggplot2
You can use google. Hint: factor()
05:00
Ctrl + Shift + K
on your keyboard. Cmd + Shift + K
on Mac.Example:
Example: