# Your code here
Lab 2: Tidy Data and Visualization in R โ PISA
Introduction
In this lab session, youโll work with a dataset containing information about PISA data. The data comes from the learningtower
package. Throughout the lab, you will use the following R functions to achieve the final visualization: filter()
, group_by()
, summarise()
, pivot_longer()
, mutate()
, ggplot()
, aes()
, geom_line()
, geom_point()
, facet_wrap()
, scale_y_continuous()
, labs()
, and theme_minimal()
.
Description of the Dataset
The dataset contains the following variables:
- year: Year of the PISA data. Factor.
- country: Country 3-character code. Note that some regions/territories are coded as country for ease of input. Factor.
- school_id: The school identification number, unique for each country and year combination. Factor.
- student_id: The student identification number, unique for each school, country, and year combination. Factor.
- mother_educ: Highest level of motherโs education. Ranges from โless than ISCED1โ to โISCED 3Aโ. Factor. Note that in 2000, all entries are missing.
- father_educ: Highest level of fatherโs education. Ranges from โless than ISCED1โ to โISCED 3Aโ. Factor. Note that in 2000, all entries are missing.
- gender: Gender of the student. Only โmaleโ and โfemaleโ are recorded. Factor. Note that we call this variable gender and not sex as this term was used in the OECD PISA database.
- computer: Possession of a computer. Only โyesโ and โnoโ are recorded. Factor.
- internet: Access to the internet. Only โyesโ and โnoโ are recorded. Factor.
- math: Simulated score in mathematics. Numeric.
- read: Simulated score in reading. Numeric.
- science: Simulated score in science. Numeric.
- stu_wgt: The final survey weight score for the student. Numeric.
- desk: Possession of a desk to study at. Only โyesโ and โnoโ are recorded. Factor.
- room: Possession of a room of your own. Only โyesโ and โnoโ are recorded. Factor.
- dishwasher: Possession of a dishwasher. Only โyesโ and โnoโ are recorded. Factor. Note that in 2015 and 2018, all entries are missing.
- television: Number of televisions. โ0โ, โ1โ, โ2โ are coded for no, one, and two TVs in the house. โ3+โ codes for three or more TVs. Factor. Note that in 2003, all entries are missing.
- computer_n: Number of computers. โ0โ, โ1โ, โ2โ are coded for no, one, and two computers in the house. โ3+โ codes for three or more computers. Factor. Note that in 2003, all entries are missing.
- car: Number of cars. โ0โ, โ1โ, โ2โ are coded for no, one, and two cars in the house. โ3+โ codes for three or more cars. Factor. Note that in 2003, all entries are missing.
- book: Number of books. Factor. Note that encoding is different in the years 2000 and 2003 compared to all other years. Evaluate
table(student$book, student$year)
for a demo. - wealth: Family wealth. Numeric. Note that in 2003, all entries are missing.
- escs: Index of economic, social, and cultural status. Numeric.
Exercises
Exercise 0: Load your package and data
Letโs start by loading the tidyverse package and the data:
Exercise 1: Filtering the Data
Start by filtering the dataset to focus on three specific countries: Canada (CAN), the United States (USA), and Mexico (MEX). You can select any country you have. The data uses country 3 character codes. If you donโt know your countryโs iso code, you can find it here.
# Your code here
Exercise 2: Summarizing the Data
Now that you have filtered the dataset, letโs summarize the PISA scores in mathematics, reading, and science for each country by year. Use the weighted.mean()
function to calculate the weighted average.
Why Use Weighting?
In survey data like PISA, not all students represent the same number of individuals in the population. The stu_wgt variable represents the survey weight, which adjusts for the survey design, non-response, and post-stratification. Applying weights is crucial because it ensures that the results are representative of the broader population. Without weighting, the mean scores could be biased, particularly if certain groups of students (e.g., from specific regions or demographic backgrounds) are over- or under-represented in the sample.
# Your code here
Exercise 3: Reshaping the Data (Tidying)
How Should Our Data Look Like?
To create an effective visualization, our data needs to be in a tidy format. Specifically, each row should represent a single observation, and each variable should be in its own column. In this case, instead of having separate columns for each test score (math, read, science), we should have a single column for the test scores and another column that identifies the type of test (e.g., math, read, science). This structure allows us to easily plot the scores across different tests and countries.
Use pivot_longer()
To achieve this tidy format, we will use the pivot_longer()
function to reshape the data. This function takes multiple columns and condenses them into key-value pairs, which results in a longer, more flexible dataset. Assign the new format to the object pisa_long
# Your code here
After running this code, your data will have four key columns: year, country, and test, with the scores in a single column named score. This format is ideal for visualization in ggplot2.
Question: How do you think the data should be formatted for effective visualization? Why is the long format preferred in this context?
Exercise 4: Data Transformation
Use the mutate()
function to capitalize the test names. This will make the final plot more readable. Re-assign the updated test
variable to the same object pisa_long
. Hint Look at the str_to_title()
function.
# Your code here
Exercise 5: Creating a Line Plot with Points
In this exercise, you will create a line plot of the PISA scores over time for each country. Line plots are useful for visualizing trends over time, while adding points helps to emphasize the data at specific time intervals. Create a ggplot object and call it pisa_plot
.
Task: Use geom_line()
to connect the data points and geom_point()
to highlight each data point on the plot.
# Your code here
Question: What is the main issue with this plot?
Exercise 6: Adding Facets to the Plot
Faceting allows you to create multiple plots based on the levels of a factor variable. In this case, we will use facets to separate the plots by the type of test (math, read, science). This makes it easier to compare the trends in different test scores.
Task: Add facet_wrap(~ test)
to the plot to create a separate panel for each test.
# Your code here
Exercise 7: Customizing Labels and Scales
Labels and scales are essential for making your plot informative and easy to read. In this exercise, you will add labels to the axes and title, and adjust the scale of the y-axis to ensure the plot is correctly displayed.
Task: Use the labs()
function to add labels and titles, and scale_y_continuous()
to set the limits of the y-axis (Hint: Look for the limits
argument).
# Your code here
Extra exercises
Exercise 8: Customizing the Plot
Experiment with different themes and color palettes to make the plot more visually appealing.
Exercise 9: Adding Context to the Visualization
Add annotations or text to the plot to highlight significant events or changes in the data.
Conclusion
By completing these exercises, you have learned how to filter, summarize, and visualize data in R. You have also gained insights into how different formatting and visualization techniques can impact the interpretation of your data. As you continue to work with data, remember the importance of context and clear communication in your visualizations.