library(tidyverse)Tutorial 6: Advanced Data Wrangling with Tidyverse
Advanced Data Wrangling in the Economics of Education
In this tutorial, we will explore advanced data wrangling techniques using the Tidyverse, with examples related to the economics of education. We will cover tasks such as merging datasets, reshaping data, handling missing values, and analyzing educational data. These techniques will help you efficiently prepare and analyze complex datasets in educational economics research. First, load the tidyverse package.
6.1 Merging Datasets in Education Research
Educational datasets often come from multiple sources, such as student assessments, school funding records, and demographic surveys. Combining these datasets is a crucial step in analysis.
6.1.1 Inner Join: Combining Student Test Scores with School Funding
An inner join merges two datasets, keeping only rows with matching values in both. For instance, you may want to combine student test scores with school funding data.
# Example data frames
test_scores_df <- data.frame(
StudentID = c(1, 2, 3, 4, 5),
Math_Score = c(85, 90, 78, 88, 92),
Reading_Score = c(82, 88, 80, 85, 91)
)
school_funding_df <- data.frame(
SchoolID = c(1, 1, 2, 2, 3),
StudentID = c(1, 2, 3, 4, 5),
Funding_Per_Student = c(10000, 10000, 8000, 8000, 9000)
)
# Performing an inner join on StudentID
combined_df <- test_scores_df |> inner_join(school_funding_df, by = "StudentID")
combined_df StudentID Math_Score Reading_Score SchoolID Funding_Per_Student
1 1 85 82 1 10000
2 2 90 88 1 10000
3 3 78 80 2 8000
4 4 88 85 2 8000
5 5 92 91 3 9000
Explanation: - inner_join() merges test_scores_df and school_funding_df by matching the StudentID column. Only students present in both datasets are included in the result.
6.1.2 Left Join: Linking Student Performance with Demographics
A left join includes all rows from the left dataset (e.g., student performance) and adds matching rows from the right dataset (e.g., demographics). Missing values are filled with NA.
# Example data frames
demographics_df <- data.frame(
StudentID = c(1, 2, 3, 6),
Gender = c("F", "M", "F", "M"),
SES = c("High", "Low", "Medium", "Low")
)
# Performing a left join on StudentID
combined_left_df <- test_scores_df |> left_join(demographics_df, by = "StudentID")
combined_left_df StudentID Math_Score Reading_Score Gender SES
1 1 85 82 F High
2 2 90 88 M Low
3 3 78 80 F Medium
4 4 88 85 <NA> <NA>
5 5 92 91 <NA> <NA>
Explanation: - left_join() includes all students from test_scores_df and adds corresponding demographic information where available. Students without demographic data have NA in those columns.
6.1.3 Full Join: Comprehensive Analysis of Multiple Datasets
A full join merges all rows from both datasets, filling unmatched areas with NA.
# Performing a full join on StudentID
full_combined_df <- test_scores_df |> full_join(demographics_df, by = "StudentID")
full_combined_df StudentID Math_Score Reading_Score Gender SES
1 1 85 82 F High
2 2 90 88 M Low
3 3 78 80 F Medium
4 4 88 85 <NA> <NA>
5 5 92 91 <NA> <NA>
6 6 NA NA M Low
Explanation: - full_join() combines all rows from test_scores_df and demographics_df, retaining all students and filling in missing values with NA.
6.2 Reshaping Educational Data
Reshaping data is often necessary to perform certain types of analyses, such as tracking student performance across subjects or years.
6.2.1 pivot_longer(): Analyzing Subject Performance Over Time
pivot_longer() is useful for converting wide data (where each subject is a column) into long data, which is more suitable for longitudinal analysis.
# Example wide data frame
subject_scores_df <- data.frame(
StudentID = c(1, 2, 3),
Math_2019 = c(85, 88, 90),
Math_2020 = c(87, 85, 92),
Reading_2019 = c(82, 80, 88),
Reading_2020 = c(85, 83, 90)
)
# Converting to long format
long_scores_df <- subject_scores_df |>
pivot_longer(cols = -StudentID, names_to = c("Subject", "Year"), names_sep = "_", values_to = "Score")
long_scores_df# A tibble: 12 × 4
StudentID Subject Year Score
<dbl> <chr> <chr> <dbl>
1 1 Math 2019 85
2 1 Math 2020 87
3 1 Reading 2019 82
4 1 Reading 2020 85
5 2 Math 2019 88
6 2 Math 2020 85
7 2 Reading 2019 80
8 2 Reading 2020 83
9 3 Math 2019 90
10 3 Math 2020 92
11 3 Reading 2019 88
12 3 Reading 2020 90
Explanation: - pivot_longer() reshapes the dataset so that each subject’s scores across different years are in their own rows. The names_to and names_sep arguments split the column names into Subject and Year.
6.2.2 pivot_wider(): Comparing Yearly Performance by Subject
pivot_wider() converts long-format data back to wide format, which is useful for comparing scores across subjects or years side by side.
# Converting back to wide format
wide_scores_df <- long_scores_df |>
pivot_wider(names_from = c("Subject", "Year"), values_from = "Score")
wide_scores_df# A tibble: 3 × 5
StudentID Math_2019 Math_2020 Reading_2019 Reading_2020
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 85 87 82 85
2 2 88 85 80 83
3 3 90 92 88 90
Explanation: - pivot_wider() rearranges the data so that each subject and year combination becomes a separate column, allowing for easy comparison across different time points.
6.3 Handling Missing Data in Educational Datasets
Educational datasets often contain missing values due to various reasons, such as incomplete surveys or missing records. Proper handling of missing data is crucial for accurate analysis.
6.3.1 Identifying and Summarizing Missing Data
You can use is.na() to identify missing values and summarise() to count them.
# Identifying missing values in the combined dataset
missing_summary <- combined_left_df |>
summarise(across(everything(), ~sum(is.na(.))))
missing_summary StudentID Math_Score Reading_Score Gender SES
1 0 0 0 2 2
Explanation: - This code counts the number of missing values (NA) in each column of the combined_left_df dataset, providing a summary of data completeness.
6.3.2 Replacing Missing Values: Imputation
Imputation involves replacing missing values with substituted values. A common approach in educational data is to use the mean or median.
# Replacing missing SES values with "Unknown"
filled_df <- combined_left_df |>
mutate(SES = replace_na(SES, "Unknown"))
filled_df StudentID Math_Score Reading_Score Gender SES
1 1 85 82 F High
2 2 90 88 M Low
3 3 78 80 F Medium
4 4 88 85 <NA> Unknown
5 5 92 91 <NA> Unknown
Explanation: - replace_na() replaces NA values in the SES column with "Unknown", which is useful when the socioeconomic status is missing.
6.3.3 Removing Rows with Missing Values
In some cases, it may be necessary to remove rows with missing data to ensure the integrity of the analysis.
# Removing rows with missing Math scores
cleaned_scores_df <- combined_df |> drop_na(Math_Score)
cleaned_scores_df StudentID Math_Score Reading_Score SchoolID Funding_Per_Student
1 1 85 82 1 10000
2 2 90 88 1 10000
3 3 78 80 2 8000
4 4 88 85 2 8000
5 5 92 91 3 9000
Explanation: - drop_na() removes rows where the Math_Score is missing, ensuring that only complete cases are used in the analysis.
Exercises and Solutions
Exercise 1: Merging Student and Teacher Data
- Create two data frames,
student_scores_dfwith columnsStudentID,Math_Score, andReading_Score, andteacher_info_dfwith columnsTeacherID,StudentID, andTeacherExperience. Perform a left join to combine these datasets.
Solution:
# Creating the data frames
student_scores_df <- data.frame(
StudentID = c(1, 2, 3, 4),
Math_Score = c(85, 90, 78, 88),
Reading_Score = c(82, 88, 80, 85)
)
teacher_info_df <- data.frame(
TeacherID = c(101, 102, 101, 103),
StudentID = c(1, 2, 3, 4),
TeacherExperience = c(5, 10, 5, 7)
)
# Performing the left join
combined_teacher_df <- student_scores_df |> left_join(teacher_info_df, by = "StudentID")
combined_teacher_df StudentID Math_Score Reading_Score TeacherID TeacherExperience
1 1 85 82 101 5
2 2 90 88 102 10
3 3 78 80 101 5
4 4 88 85 103 7
Exercise 2: Reshaping Data for Cohort Analysis
- Start with a wide data frame
cohort_scores_dfwith columnsStudentID,Cohort_2019_Math,Cohort_2019_Reading,Cohort_2020_Math, andCohort_2020_Reading. Convert this data frame to a long format, separating cohort year and subject.
Solution:
# Creating the wide data frame
cohort_scores_df <- data.frame(
StudentID = c(1, 2, 3),
Cohort_2019_Math = c(85, 88, 90),
Cohort_2019_Reading = c(82, 80, 88),
Cohort_2020_Math = c(87, 85, 92),
Cohort_2020_Reading = c(85, 83, 90)
)
# Converting to long format
long_cohort_df <- cohort_scores_df |>
pivot_longer(
cols = -StudentID,
names_to = c("Cohort_Year", "Subject"), names_sep = "_", values_to = "Score"
)Warning: Expected 2 pieces. Additional pieces discarded in 4 rows [1, 2, 3, 4].
long_cohort_df# A tibble: 12 × 4
StudentID Cohort_Year Subject Score
<dbl> <chr> <chr> <dbl>
1 1 Cohort 2019 85
2 1 Cohort 2019 82
3 1 Cohort 2020 87
4 1 Cohort 2020 85
5 2 Cohort 2019 88
6 2 Cohort 2019 80
7 2 Cohort 2020 85
8 2 Cohort 2020 83
9 3 Cohort 2019 90
10 3 Cohort 2019 88
11 3 Cohort 2020 92
12 3 Cohort 2020 90
Exercise 3: Handling Missing Data in School Surveys
- Create a data frame
school_survey_dfwith columnsSchoolID,StudentSatisfaction, andTeacherTurnover. Introduce missing values into theStudentSatisfactioncolumn. Replace these missing values with the median of the available satisfaction scores.
Solution:
# Creating the data frame with missing values
school_survey_df <- data.frame(
SchoolID = c(1, 2, 3, 4),
StudentSatisfaction = c(85, NA, 78, 90),
TeacherTurnover = c(10, 15, 8, 12)
)
# Replacing missing values with the median
school_survey_df <- school_survey_df |>
mutate(StudentSatisfaction = replace_na(StudentSatisfaction, median(StudentSatisfaction, na.rm = TRUE)))
school_survey_df SchoolID StudentSatisfaction TeacherTurnover
1 1 85 10
2 2 85 15
3 3 78 8
4 4 90 12
Exercise 4: Combining Data with Different Keys
- Create two data frames,
school_expenditure_dfwith columnsSchoolID,Expenditure_Per_Student, andTotal_Expenditure, andstudent_scores_dfwithSchoolID,Math_Score, andReading_Score. Perform a full join onSchoolIDto combine these datasets, then filter to keep only schools with available expenditure data.
Solution:
# Creating the data frames
school_expenditure_df <- data.frame(
SchoolID = c(1, 2, 3),
Expenditure_Per_Student = c(10000, 8000, 9000),
Total_Expenditure = c(500000, 400000, 450000)
)
student_scores_df <- data.frame(
SchoolID = c(1, 2, 4),
Math_Score = c(85, 90, 78),
Reading_Score = c(82, 88, 80)
)
# Performing the full join
combined_expenditure_df <- school_expenditure_df |> full_join(student_scores_df, by = "SchoolID")
# Filtering to keep only schools with expenditure data
filtered_expenditure_df <- combined_expenditure_df |> filter(!is.na(Expenditure_Per_Student))
filtered_expenditure_df SchoolID Expenditure_Per_Student Total_Expenditure Math_Score Reading_Score
1 1 10000 500000 85 82
2 2 8000 400000 90 88
3 3 9000 450000 NA NA
Exercise 5: Reshaping Data for Trend Analysis
- Using a data frame
annual_scores_dfwith columnsYear,SchoolID,Average_Math_Score, andAverage_Reading_Score, convert the data from wide to long format to analyze trends in average scores over time.
Solution:
# Creating the wide data frame
annual_scores_df <- data.frame(
Year = c(2019, 2020, 2021),
SchoolID = c(1, 1, 1),
Average_Math_Score = c(78, 80, 82),
Average_Reading_Score = c(75, 77, 79)
)
# Converting to long format
long_annual_scores_df <- annual_scores_df |>
pivot_longer(cols = starts_with("Average"), names_to = "Subject", names_prefix = "Average_", values_to = "Score")
long_annual_scores_df# A tibble: 6 × 4
Year SchoolID Subject Score
<dbl> <dbl> <chr> <dbl>
1 2019 1 Math_Score 78
2 2019 1 Reading_Score 75
3 2020 1 Math_Score 80
4 2020 1 Reading_Score 77
5 2021 1 Math_Score 82
6 2021 1 Reading_Score 79