library(tidyverse)
Tutorial 6: Advanced Data Wrangling with Tidyverse
Advanced Data Wrangling in the Economics of Education
In this tutorial, we will explore advanced data wrangling techniques using the Tidyverse, with examples related to the economics of education. We will cover tasks such as merging datasets, reshaping data, handling missing values, and analyzing educational data. These techniques will help you efficiently prepare and analyze complex datasets in educational economics research. First, load the tidyverse
package.
6.1 Merging Datasets in Education Research
Educational datasets often come from multiple sources, such as student assessments, school funding records, and demographic surveys. Combining these datasets is a crucial step in analysis.
6.1.1 Inner Join: Combining Student Test Scores with School Funding
An inner join merges two datasets, keeping only rows with matching values in both. For instance, you may want to combine student test scores with school funding data.
# Example data frames
<- data.frame(
test_scores_df StudentID = c(1, 2, 3, 4, 5),
Math_Score = c(85, 90, 78, 88, 92),
Reading_Score = c(82, 88, 80, 85, 91)
)
<- data.frame(
school_funding_df SchoolID = c(1, 1, 2, 2, 3),
StudentID = c(1, 2, 3, 4, 5),
Funding_Per_Student = c(10000, 10000, 8000, 8000, 9000)
)
# Performing an inner join on StudentID
<- test_scores_df |> inner_join(school_funding_df, by = "StudentID")
combined_df combined_df
StudentID Math_Score Reading_Score SchoolID Funding_Per_Student
1 1 85 82 1 10000
2 2 90 88 1 10000
3 3 78 80 2 8000
4 4 88 85 2 8000
5 5 92 91 3 9000
Explanation: - inner_join()
merges test_scores_df
and school_funding_df
by matching the StudentID
column. Only students present in both datasets are included in the result.
6.1.2 Left Join: Linking Student Performance with Demographics
A left join includes all rows from the left dataset (e.g., student performance) and adds matching rows from the right dataset (e.g., demographics). Missing values are filled with NA
.
# Example data frames
<- data.frame(
demographics_df StudentID = c(1, 2, 3, 6),
Gender = c("F", "M", "F", "M"),
SES = c("High", "Low", "Medium", "Low")
)
# Performing a left join on StudentID
<- test_scores_df |> left_join(demographics_df, by = "StudentID")
combined_left_df combined_left_df
StudentID Math_Score Reading_Score Gender SES
1 1 85 82 F High
2 2 90 88 M Low
3 3 78 80 F Medium
4 4 88 85 <NA> <NA>
5 5 92 91 <NA> <NA>
Explanation: - left_join()
includes all students from test_scores_df
and adds corresponding demographic information where available. Students without demographic data have NA
in those columns.
6.1.3 Full Join: Comprehensive Analysis of Multiple Datasets
A full join merges all rows from both datasets, filling unmatched areas with NA
.
# Performing a full join on StudentID
<- test_scores_df |> full_join(demographics_df, by = "StudentID")
full_combined_df full_combined_df
StudentID Math_Score Reading_Score Gender SES
1 1 85 82 F High
2 2 90 88 M Low
3 3 78 80 F Medium
4 4 88 85 <NA> <NA>
5 5 92 91 <NA> <NA>
6 6 NA NA M Low
Explanation: - full_join()
combines all rows from test_scores_df
and demographics_df
, retaining all students and filling in missing values with NA
.
6.2 Reshaping Educational Data
Reshaping data is often necessary to perform certain types of analyses, such as tracking student performance across subjects or years.
6.2.1 pivot_longer(): Analyzing Subject Performance Over Time
pivot_longer()
is useful for converting wide data (where each subject is a column) into long data, which is more suitable for longitudinal analysis.
# Example wide data frame
<- data.frame(
subject_scores_df StudentID = c(1, 2, 3),
Math_2019 = c(85, 88, 90),
Math_2020 = c(87, 85, 92),
Reading_2019 = c(82, 80, 88),
Reading_2020 = c(85, 83, 90)
)
# Converting to long format
<- subject_scores_df |>
long_scores_df pivot_longer(cols = -StudentID, names_to = c("Subject", "Year"), names_sep = "_", values_to = "Score")
long_scores_df
# A tibble: 12 × 4
StudentID Subject Year Score
<dbl> <chr> <chr> <dbl>
1 1 Math 2019 85
2 1 Math 2020 87
3 1 Reading 2019 82
4 1 Reading 2020 85
5 2 Math 2019 88
6 2 Math 2020 85
7 2 Reading 2019 80
8 2 Reading 2020 83
9 3 Math 2019 90
10 3 Math 2020 92
11 3 Reading 2019 88
12 3 Reading 2020 90
Explanation: - pivot_longer()
reshapes the dataset so that each subject’s scores across different years are in their own rows. The names_to
and names_sep
arguments split the column names into Subject
and Year
.
6.2.2 pivot_wider(): Comparing Yearly Performance by Subject
pivot_wider()
converts long-format data back to wide format, which is useful for comparing scores across subjects or years side by side.
# Converting back to wide format
<- long_scores_df |>
wide_scores_df pivot_wider(names_from = c("Subject", "Year"), values_from = "Score")
wide_scores_df
# A tibble: 3 × 5
StudentID Math_2019 Math_2020 Reading_2019 Reading_2020
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 85 87 82 85
2 2 88 85 80 83
3 3 90 92 88 90
Explanation: - pivot_wider()
rearranges the data so that each subject and year combination becomes a separate column, allowing for easy comparison across different time points.
6.3 Handling Missing Data in Educational Datasets
Educational datasets often contain missing values due to various reasons, such as incomplete surveys or missing records. Proper handling of missing data is crucial for accurate analysis.
6.3.1 Identifying and Summarizing Missing Data
You can use is.na()
to identify missing values and summarise()
to count them.
# Identifying missing values in the combined dataset
<- combined_left_df |>
missing_summary summarise(across(everything(), ~sum(is.na(.))))
missing_summary
StudentID Math_Score Reading_Score Gender SES
1 0 0 0 2 2
Explanation: - This code counts the number of missing values (NA
) in each column of the combined_left_df
dataset, providing a summary of data completeness.
6.3.2 Replacing Missing Values: Imputation
Imputation involves replacing missing values with substituted values. A common approach in educational data is to use the mean or median.
# Replacing missing SES values with "Unknown"
<- combined_left_df |>
filled_df mutate(SES = replace_na(SES, "Unknown"))
filled_df
StudentID Math_Score Reading_Score Gender SES
1 1 85 82 F High
2 2 90 88 M Low
3 3 78 80 F Medium
4 4 88 85 <NA> Unknown
5 5 92 91 <NA> Unknown
Explanation: - replace_na()
replaces NA
values in the SES
column with "Unknown"
, which is useful when the socioeconomic status is missing.
6.3.3 Removing Rows with Missing Values
In some cases, it may be necessary to remove rows with missing data to ensure the integrity of the analysis.
# Removing rows with missing Math scores
<- combined_df |> drop_na(Math_Score)
cleaned_scores_df cleaned_scores_df
StudentID Math_Score Reading_Score SchoolID Funding_Per_Student
1 1 85 82 1 10000
2 2 90 88 1 10000
3 3 78 80 2 8000
4 4 88 85 2 8000
5 5 92 91 3 9000
Explanation: - drop_na()
removes rows where the Math_Score
is missing, ensuring that only complete cases are used in the analysis.
Exercises and Solutions
Exercise 1: Merging Student and Teacher Data
- Create two data frames,
student_scores_df
with columnsStudentID
,Math_Score
, andReading_Score
, andteacher_info_df
with columnsTeacherID
,StudentID
, andTeacherExperience
. Perform a left join to combine these datasets.
Solution:
# Creating the data frames
<- data.frame(
student_scores_df StudentID = c(1, 2, 3, 4),
Math_Score = c(85, 90, 78, 88),
Reading_Score = c(82, 88, 80, 85)
)
<- data.frame(
teacher_info_df TeacherID = c(101, 102, 101, 103),
StudentID = c(1, 2, 3, 4),
TeacherExperience = c(5, 10, 5, 7)
)
# Performing the left join
<- student_scores_df |> left_join(teacher_info_df, by = "StudentID")
combined_teacher_df combined_teacher_df
StudentID Math_Score Reading_Score TeacherID TeacherExperience
1 1 85 82 101 5
2 2 90 88 102 10
3 3 78 80 101 5
4 4 88 85 103 7
Exercise 2: Reshaping Data for Cohort Analysis
- Start with a wide data frame
cohort_scores_df
with columnsStudentID
,Cohort_2019_Math
,Cohort_2019_Reading
,Cohort_2020_Math
, andCohort_2020_Reading
. Convert this data frame to a long format, separating cohort year and subject.
Solution:
# Creating the wide data frame
<- data.frame(
cohort_scores_df StudentID = c(1, 2, 3),
Cohort_2019_Math = c(85, 88, 90),
Cohort_2019_Reading = c(82, 80, 88),
Cohort_2020_Math = c(87, 85, 92),
Cohort_2020_Reading = c(85, 83, 90)
)
# Converting to long format
<- cohort_scores_df |>
long_cohort_df pivot_longer(
cols = -StudentID,
names_to = c("Cohort_Year", "Subject"), names_sep = "_", values_to = "Score"
)
Warning: Expected 2 pieces. Additional pieces discarded in 4 rows [1, 2, 3, 4].
long_cohort_df
# A tibble: 12 × 4
StudentID Cohort_Year Subject Score
<dbl> <chr> <chr> <dbl>
1 1 Cohort 2019 85
2 1 Cohort 2019 82
3 1 Cohort 2020 87
4 1 Cohort 2020 85
5 2 Cohort 2019 88
6 2 Cohort 2019 80
7 2 Cohort 2020 85
8 2 Cohort 2020 83
9 3 Cohort 2019 90
10 3 Cohort 2019 88
11 3 Cohort 2020 92
12 3 Cohort 2020 90
Exercise 3: Handling Missing Data in School Surveys
- Create a data frame
school_survey_df
with columnsSchoolID
,StudentSatisfaction
, andTeacherTurnover
. Introduce missing values into theStudentSatisfaction
column. Replace these missing values with the median of the available satisfaction scores.
Solution:
# Creating the data frame with missing values
<- data.frame(
school_survey_df SchoolID = c(1, 2, 3, 4),
StudentSatisfaction = c(85, NA, 78, 90),
TeacherTurnover = c(10, 15, 8, 12)
)
# Replacing missing values with the median
<- school_survey_df |>
school_survey_df mutate(StudentSatisfaction = replace_na(StudentSatisfaction, median(StudentSatisfaction, na.rm = TRUE)))
school_survey_df
SchoolID StudentSatisfaction TeacherTurnover
1 1 85 10
2 2 85 15
3 3 78 8
4 4 90 12
Exercise 4: Combining Data with Different Keys
- Create two data frames,
school_expenditure_df
with columnsSchoolID
,Expenditure_Per_Student
, andTotal_Expenditure
, andstudent_scores_df
withSchoolID
,Math_Score
, andReading_Score
. Perform a full join onSchoolID
to combine these datasets, then filter to keep only schools with available expenditure data.
Solution:
# Creating the data frames
<- data.frame(
school_expenditure_df SchoolID = c(1, 2, 3),
Expenditure_Per_Student = c(10000, 8000, 9000),
Total_Expenditure = c(500000, 400000, 450000)
)
<- data.frame(
student_scores_df SchoolID = c(1, 2, 4),
Math_Score = c(85, 90, 78),
Reading_Score = c(82, 88, 80)
)
# Performing the full join
<- school_expenditure_df |> full_join(student_scores_df, by = "SchoolID")
combined_expenditure_df
# Filtering to keep only schools with expenditure data
<- combined_expenditure_df |> filter(!is.na(Expenditure_Per_Student))
filtered_expenditure_df filtered_expenditure_df
SchoolID Expenditure_Per_Student Total_Expenditure Math_Score Reading_Score
1 1 10000 500000 85 82
2 2 8000 400000 90 88
3 3 9000 450000 NA NA
Exercise 5: Reshaping Data for Trend Analysis
- Using a data frame
annual_scores_df
with columnsYear
,SchoolID
,Average_Math_Score
, andAverage_Reading_Score
, convert the data from wide to long format to analyze trends in average scores over time.
Solution:
# Creating the wide data frame
<- data.frame(
annual_scores_df Year = c(2019, 2020, 2021),
SchoolID = c(1, 1, 1),
Average_Math_Score = c(78, 80, 82),
Average_Reading_Score = c(75, 77, 79)
)
# Converting to long format
<- annual_scores_df |>
long_annual_scores_df pivot_longer(cols = starts_with("Average"), names_to = "Subject", names_prefix = "Average_", values_to = "Score")
long_annual_scores_df
# A tibble: 6 × 4
Year SchoolID Subject Score
<dbl> <dbl> <chr> <dbl>
1 2019 1 Math_Score 78
2 2019 1 Reading_Score 75
3 2020 1 Math_Score 80
4 2020 1 Reading_Score 77
5 2021 1 Math_Score 82
6 2021 1 Reading_Score 79