Lab 1: Analyzing Olympic Data with R

Author

Rony Rodriguez-Ramirez

Published

16 August 2024

Housekeeping

If you haven’t installed R and RStudio, use the following Posit Cloud project

STOP: Super important warning!

If you didn’t complete the summer assignments, you should definitely make time to do complete the following primers. The original content is coming from RStudio and was adapted by Prof. Andrew Heiss.

For the first part of this week’s lesson, you need to work through a few of Posit’s introductory primers. You’ll do these in your browser, where you can type code and see results immediately.

You’ll learn some of the basics of R, as well as some powerful methods for manipulating data with the {dplyr} package.

Complete these primers. It may seem like there are a lot, but they’re short and go fairly quickly, especially as you get the hang of the syntax. Also, I have no way of seeing what you do or what you get wrong or right, and that’s totally fine! If you get stuck or want to skip some (or if it gets too easy), feel free to move on!

The Basics
- Visualization basics
- Programming basics
Work with Data

The content from these primers comes from the (free and online!) book R for Data Science by Garrett Grolemund and Hadley Wickham. I highly recommend the book as a reference and for continuing to learn and use R in the future (like running regression models and other types of statistical analysis).

Introduction

In this lab session, you’ll work with a dataset containing information about the Olympics. The dataset includes various variables, such as the edition of the games, country codes, sports, events, athletes, and the results. Your goal is to explore this data, identify patterns, handle missing values, and generate insights using R.

Throughout the lab, you will use the following R functions:

read_csv(): This function from the readr package reads a CSV file and creates a data frame.
head(): Displays the first few rows of a data frame to give you a quick look at the data.
summary(): Provides summary statistics for each variable in the dataset.
n_distinct(): Counts the number of unique values for a particular variable.
colSums(): Sums up the values in each column, which is useful for counting missing values.
is.na(): Checks for NA (missing) values in the dataset.
na.omit(): Removes rows with missing values from a data frame.
filter(): Extracts rows from a data frame that meet certain conditions.
group_by(): Groups the data by one or more variables, which is often used before summarizing data.
summarize(): Creates summary statistics for each group in the data.
arrange(): Orders the rows of a data frame based on the values of one or more columns.
pull(): Extracts a single column from a data frame as a vector.
slice(): Selects rows by position from a data frame.
distinct(): Extracts distinct (unique) rows from a data frame.
count(): Counts the number of occurrences of each unique value in a column.

These functions will enable you to load and explore the dataset, handle missing data, and perform various analyses to extract insights.

The dataset contains the following variables:

edition: The edition of the Olympic Games.
edition_id: A unique identifier for the edition.
country_noc: The National Olympic Committee (NOC) code representing the country.
sport: The sport in which the event took place.
event: The specific event within the sport.
result_id: A unique identifier for the result.
athlete: The name of the athlete.
athlete_id: A unique identifier for the athlete.
pos: The position or rank the athlete achieved in the event.
medal: The type of medal won (if any).
isTeamSport: Indicates whether the event is a team sport.

Loading Packages and Data

To start, you’ll need to load the necessary packages and the dataset. This will allow you to perform the analyses required for the lab exercises.

# Load necessary packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load the dataset
olympics_data <- read_csv("https://raw.githubusercontent.com/josephwccheng/olympedia_web_scraping/main/data/Olympic_Athlete_Event_Results.csv")

Rows: 314907 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): edition, country_noc, sport, event, athlete, pos, medal
dbl (3): edition_id, result_id, athlete_id
lgl (1): isTeamSport

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exercise 1: Data Exploration

Load the dataset into R.
Use the code provided above to load the dataset.
Get an overview of the dataset.
You should display the first few rows and summarize each variable to understand the dataset’s structure. This step is crucial for familiarizing yourself with the data you’ll be working with.
Determine the number of unique editions, sports, and events in the dataset.
Your goal here is to identify the diversity within the dataset. Use R functions to calculate how many unique Olympic editions, sports, and events are represented in the data.

Expected Outcome:

An understanding of the dataset’s structure.
Insights into the diversity of the data, such as the number of unique Olympic editions and sports.

Exercise 2: Handling Missing Values

Identify missing values.
Determine which variables have missing values and how many missing values are present in each. This will help you understand the completeness of the data.
Create a subset of the data where all missing values are removed.
You should generate a clean dataset without missing values. Consider how this might impact your analysis.
Discuss the impact of removing rows with missing values.
Reflect on how the removal of rows could influence the results and representativeness of the data.

Expected Outcome:

A list of variables with missing values.
A cleaned version of the dataset.
A thoughtful consideration of the implications of removing missing data.

Exercise 3: Analyzing Medals Distribution

Calculate the total number of medals won by each country.
You’ll need to group the data by country and count the total number of medals won.
Identify the country with the most gold medals.
Focus on identifying which country has excelled the most in terms of winning gold medals.

Expected Outcome:

A summary table showing the total medals won by each country.
Identification of the top-performing country in terms of gold medals.

Exercise 4: Analyzing Performance by Athlete

Identify the athlete with the most medals overall.
Your task is to find the athlete who has won the most medals in the Olympics.
Determine the number of unique events the athlete has participated in.
Investigate the range of events this top athlete has competed in.

Expected Outcome:

The name of the athlete with the most medals.
The number of unique events this athlete has participated in, offering insight into their versatility.

Exercise 5: Team Sports vs. Individual Sports

Compare the number of medals won in team sports versus individual sports.
Analyze how successful athletes have been in team sports compared to individual sports.
Identify the most successful team sport.
Determine which team sport has yielded the most medals.

Expected Outcome:

A comparison of medals won in team versus individual sports.
Identification of the most successful team sport, providing insights into which team events dominate in terms of medals.

Conclusion

This lab session is designed to help you work with real-world Olympic data, exploring various techniques to handle and analyze data in R. By the end of this session, you should be able to perform basic data exploration, handle missing data, and analyze key aspects of the dataset. Make sure to reflect on any challenges you encounter and bring them up in the next session for discussion.