Lab 2: Tidy Data and Visualization in R โ€“ PISA

Author

Rony Rodriguez-Ramirez

Published

21 August 2024

Introduction

In this lab session, youโ€™ll work with a dataset containing information about PISA data. The data comes from the learningtower package. Throughout the lab, you will use the following R functions to achieve the final visualization: filter(), group_by(), summarise(), pivot_longer(), mutate(), ggplot(), aes(), geom_line(), geom_point(), facet_wrap(), scale_y_continuous(), labs(), and theme_minimal().

Description of the Dataset

The dataset contains the following variables:

  • year: Year of the PISA data. Factor.
  • country: Country 3-character code. Note that some regions/territories are coded as country for ease of input. Factor.
  • school_id: The school identification number, unique for each country and year combination. Factor.
  • student_id: The student identification number, unique for each school, country, and year combination. Factor.
  • mother_educ: Highest level of motherโ€™s education. Ranges from โ€œless than ISCED1โ€ to โ€œISCED 3Aโ€. Factor. Note that in 2000, all entries are missing.
  • father_educ: Highest level of fatherโ€™s education. Ranges from โ€œless than ISCED1โ€ to โ€œISCED 3Aโ€. Factor. Note that in 2000, all entries are missing.
  • gender: Gender of the student. Only โ€œmaleโ€ and โ€œfemaleโ€ are recorded. Factor. Note that we call this variable gender and not sex as this term was used in the OECD PISA database.
  • computer: Possession of a computer. Only โ€œyesโ€ and โ€œnoโ€ are recorded. Factor.
  • internet: Access to the internet. Only โ€œyesโ€ and โ€œnoโ€ are recorded. Factor.
  • math: Simulated score in mathematics. Numeric.
  • read: Simulated score in reading. Numeric.
  • science: Simulated score in science. Numeric.
  • stu_wgt: The final survey weight score for the student. Numeric.
  • desk: Possession of a desk to study at. Only โ€œyesโ€ and โ€œnoโ€ are recorded. Factor.
  • room: Possession of a room of your own. Only โ€œyesโ€ and โ€œnoโ€ are recorded. Factor.
  • dishwasher: Possession of a dishwasher. Only โ€œyesโ€ and โ€œnoโ€ are recorded. Factor. Note that in 2015 and 2018, all entries are missing.
  • television: Number of televisions. โ€œ0โ€, โ€œ1โ€, โ€œ2โ€ are coded for no, one, and two TVs in the house. โ€œ3+โ€ codes for three or more TVs. Factor. Note that in 2003, all entries are missing.
  • computer_n: Number of computers. โ€œ0โ€, โ€œ1โ€, โ€œ2โ€ are coded for no, one, and two computers in the house. โ€œ3+โ€ codes for three or more computers. Factor. Note that in 2003, all entries are missing.
  • car: Number of cars. โ€œ0โ€, โ€œ1โ€, โ€œ2โ€ are coded for no, one, and two cars in the house. โ€œ3+โ€ codes for three or more cars. Factor. Note that in 2003, all entries are missing.
  • book: Number of books. Factor. Note that encoding is different in the years 2000 and 2003 compared to all other years. Evaluate table(student$book, student$year) for a demo.
  • wealth: Family wealth. Numeric. Note that in 2003, all entries are missing.
  • escs: Index of economic, social, and cultural status. Numeric.

Exercises

Exercise 0: Load your package and data

Letโ€™s start by loading the tidyverse package and the data:

# Your code here

Exercise 1: Filtering the Data

Start by filtering the dataset to focus on three specific countries: Canada (CAN), the United States (USA), and Mexico (MEX). You can select any country you have. The data uses country 3 character codes. If you donโ€™t know your countryโ€™s iso code, you can find it here.

# Your code here

Exercise 2: Summarizing the Data

Now that you have filtered the dataset, letโ€™s summarize the PISA scores in mathematics, reading, and science for each country by year. Use the weighted.mean() function to calculate the weighted average.

Why Use Weighting?

In survey data like PISA, not all students represent the same number of individuals in the population. The stu_wgt variable represents the survey weight, which adjusts for the survey design, non-response, and post-stratification. Applying weights is crucial because it ensures that the results are representative of the broader population. Without weighting, the mean scores could be biased, particularly if certain groups of students (e.g., from specific regions or demographic backgrounds) are over- or under-represented in the sample.

# Your code here

Exercise 3: Reshaping the Data (Tidying)

How Should Our Data Look Like?

To create an effective visualization, our data needs to be in a tidy format. Specifically, each row should represent a single observation, and each variable should be in its own column. In this case, instead of having separate columns for each test score (math, read, science), we should have a single column for the test scores and another column that identifies the type of test (e.g., math, read, science). This structure allows us to easily plot the scores across different tests and countries.

Use pivot_longer()

To achieve this tidy format, we will use the pivot_longer() function to reshape the data. This function takes multiple columns and condenses them into key-value pairs, which results in a longer, more flexible dataset. Assign the new format to the object pisa_long

# Your code here

After running this code, your data will have four key columns: year, country, and test, with the scores in a single column named score. This format is ideal for visualization in ggplot2.

Question: How do you think the data should be formatted for effective visualization? Why is the long format preferred in this context?

Exercise 4: Data Transformation

Use the mutate() function to capitalize the test names. This will make the final plot more readable. Re-assign the updated test variable to the same object pisa_long. Hint Look at the str_to_title() function.

# Your code here

Exercise 5: Creating a Line Plot with Points

In this exercise, you will create a line plot of the PISA scores over time for each country. Line plots are useful for visualizing trends over time, while adding points helps to emphasize the data at specific time intervals. Create a ggplot object and call it pisa_plot.

Task: Use geom_line() to connect the data points and geom_point() to highlight each data point on the plot.

# Your code here

Question: What is the main issue with this plot?

Exercise 6: Adding Facets to the Plot

Faceting allows you to create multiple plots based on the levels of a factor variable. In this case, we will use facets to separate the plots by the type of test (math, read, science). This makes it easier to compare the trends in different test scores.

Task: Add facet_wrap(~ test) to the plot to create a separate panel for each test.

# Your code here

Exercise 7: Customizing Labels and Scales

Labels and scales are essential for making your plot informative and easy to read. In this exercise, you will add labels to the axes and title, and adjust the scale of the y-axis to ensure the plot is correctly displayed.

Task: Use the labs() function to add labels and titles, and scale_y_continuous() to set the limits of the y-axis (Hint: Look for the limits argument).

# Your code here

Extra exercises

Exercise 8: Customizing the Plot

Experiment with different themes and color palettes to make the plot more visually appealing.

Exercise 9: Adding Context to the Visualization

Add annotations or text to the plot to highlight significant events or changes in the data.

Conclusion

By completing these exercises, you have learned how to filter, summarize, and visualize data in R. You have also gained insights into how different formatting and visualization techniques can impact the interpretation of your data. As you continue to work with data, remember the importance of context and clear communication in your visualizations.