If you havenβt installed R and RStudio, use the following Posit Cloud project
STOP: Super important warning!
If you didnβt complete the summer assignments, you should definitely make time to do complete the following primers. The original content is coming from RStudio and was adapted by Prof. Andrew Heiss.
For the first part of this weekβs lesson, you need to work through a few of Positβs introductory primers. Youβll do these in your browser, where you can type code and see results immediately.
Youβll learn some of the basics of R, as well as some powerful methods for manipulating data with the {dplyr} package.
Complete these primers. It may seem like there are a lot, but theyβre short and go fairly quickly, especially as you get the hang of the syntax. Also, I have no way of seeing what you do or what you get wrong or right, and thatβs totally fine! If you get stuck or want to skip some (or if it gets too easy), feel free to move on!
The content from these primers comes from the (free and online!) book R for Data Science by Garrett Grolemund and Hadley Wickham. I highly recommend the book as a reference and for continuing to learn and use R in the future (like running regression models and other types of statistical analysis).
Introduction
In this lab session, youβll work with a dataset containing information about the Olympics. The dataset includes various variables, such as the edition of the games, country codes, sports, events, athletes, and the results. Your goal is to explore this data, identify patterns, handle missing values, and generate insights using R.
Throughout the lab, you will use the following R functions:
read_csv(): This function from the readr package reads a CSV file and creates a data frame.
head(): Displays the first few rows of a data frame to give you a quick look at the data.
summary(): Provides summary statistics for each variable in the dataset.
n_distinct(): Counts the number of unique values for a particular variable.
colSums(): Sums up the values in each column, which is useful for counting missing values.
is.na(): Checks for NA (missing) values in the dataset.
na.omit(): Removes rows with missing values from a data frame.
filter(): Extracts rows from a data frame that meet certain conditions.
group_by(): Groups the data by one or more variables, which is often used before summarizing data.
summarize(): Creates summary statistics for each group in the data.
arrange(): Orders the rows of a data frame based on the values of one or more columns.
pull(): Extracts a single column from a data frame as a vector.
slice(): Selects rows by position from a data frame.
distinct(): Extracts distinct (unique) rows from a data frame.
count(): Counts the number of occurrences of each unique value in a column.
These functions will enable you to load and explore the dataset, handle missing data, and perform various analyses to extract insights.
The dataset contains the following variables:
edition: The edition of the Olympic Games.
edition_id: A unique identifier for the edition.
country_noc: The National Olympic Committee (NOC) code representing the country.
sport: The sport in which the event took place.
event: The specific event within the sport.
result_id: A unique identifier for the result.
athlete: The name of the athlete.
athlete_id: A unique identifier for the athlete.
pos: The position or rank the athlete achieved in the event.
medal: The type of medal won (if any).
isTeamSport: Indicates whether the event is a team sport.
Loading Packages and Data
To start, youβll need to load the necessary packages and the dataset. This will allow you to perform the analyses required for the lab exercises.
# Load necessary packageslibrary(tidyverse)
ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.4 β readr 2.1.5
β forcats 1.0.0 β stringr 1.5.1
β ggplot2 3.5.1 β tibble 3.2.1
β lubridate 1.9.3 β tidyr 1.3.1
β purrr 1.0.2
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the datasetolympics_data <-read_csv("https://raw.githubusercontent.com/josephwccheng/olympedia_web_scraping/main/data/Olympic_Athlete_Event_Results.csv")
Rows: 314907 Columns: 11
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (7): edition, country_noc, sport, event, athlete, pos, medal
dbl (3): edition_id, result_id, athlete_id
lgl (1): isTeamSport
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exercise 1: Data Exploration
Load the dataset into R.
Use the code provided above to load the dataset.
Get an overview of the dataset.
You should display the first few rows and summarize each variable to understand the datasetβs structure. This step is crucial for familiarizing yourself with the data youβll be working with.
Determine the number of unique editions, sports, and events in the dataset.
Your goal here is to identify the diversity within the dataset. Use R functions to calculate how many unique Olympic editions, sports, and events are represented in the data.
Expected Outcome:
An understanding of the datasetβs structure.
Insights into the diversity of the data, such as the number of unique Olympic editions and sports.
Exercise 2: Handling Missing Values
Identify missing values.
Determine which variables have missing values and how many missing values are present in each. This will help you understand the completeness of the data.
Create a subset of the data where all missing values are removed.
You should generate a clean dataset without missing values. Consider how this might impact your analysis.
Discuss the impact of removing rows with missing values.
Reflect on how the removal of rows could influence the results and representativeness of the data.
Expected Outcome:
A list of variables with missing values.
A cleaned version of the dataset.
A thoughtful consideration of the implications of removing missing data.
Exercise 3: Analyzing Medals Distribution
Calculate the total number of medals won by each country.
Youβll need to group the data by country and count the total number of medals won.
Identify the country with the most gold medals.
Focus on identifying which country has excelled the most in terms of winning gold medals.
Expected Outcome:
A summary table showing the total medals won by each country.
Identification of the top-performing country in terms of gold medals.
Exercise 4: Analyzing Performance by Athlete
Identify the athlete with the most medals overall.
Your task is to find the athlete who has won the most medals in the Olympics.
Determine the number of unique events the athlete has participated in.
Investigate the range of events this top athlete has competed in.
Expected Outcome:
The name of the athlete with the most medals.
The number of unique events this athlete has participated in, offering insight into their versatility.
Exercise 5: Team Sports vs. Individual Sports
Compare the number of medals won in team sports versus individual sports.
Analyze how successful athletes have been in team sports compared to individual sports.
Identify the most successful team sport.
Determine which team sport has yielded the most medals.
Expected Outcome:
A comparison of medals won in team versus individual sports.
Identification of the most successful team sport, providing insights into which team events dominate in terms of medals.
Conclusion
This lab session is designed to help you work with real-world Olympic data, exploring various techniques to handle and analyze data in R. By the end of this session, you should be able to perform basic data exploration, handle missing data, and analyze key aspects of the dataset. Make sure to reflect on any challenges you encounter and bring them up in the next session for discussion.