Hands-on: Data Wrangling

Author

Rony Rodriguez-Ramirez

Published

15 August 2024

Introduction

Welcome to the interactive hands-on session on migration data. This session is designed to be an interactive part of our website, allowing you to engage directly with the R code and data manipulation techniques discussed. For your convenience, the same script used here is available in R format within our Posit Cloud project, which you can access here. You are also welcome to use RStudio or Positron on your local machine to follow along.

Migration Data

The data on immigrant and emigrant stocks used in this session is sourced from the United Nations Department of Economic and Social Affairs (UN DESA).

How does the UN define a migrant?

According to the United Nations Population Division, an international migrant is someone who has been living for one year or longer in a country other than the one in which they were born. This definition includes many foreign workers and international students, as well as refugees and, in some cases, their descendants (such as Palestinians born in refugee camps outside of the Palestinian territories). Estimates of unauthorized immigrants living in various countries are also included in these totals.

Tourists, foreign-aid workers, temporary workers employed abroad for less than a year, and overseas military personnel are typically not counted as migrants.

Loading and Cleaning the Data

Let’s begin by loading our data using the read.csv function:

Note for Web-R Application Users

If you’re working on your local computer or in Posit Cloud, you can use the read_csv() function instead. The read.csv() function is used here due to bugs related to the Web-R application.

We can use the glimpse() function to check our columns (variables). Now, you realize that probably this dataset contains really nasty names. This is what we will usually called as “raw” data.

We’ll use the clean_names() function from janitor to standardize column names by converting them to snake_case, which makes them easier to work with.

This web-r application

This web-r application already loaded the required package. See at the top of this page. Therefore, remember to load the required packages when you test yourself or use the posit cloud project.

This should be sufficient for the type of names we have.

Exercise 1

Part 1:. Let’s focus on a subset of countries in Central America. We want to analyze data from Nicaragua, El Salvador, Costa Rica, Panama, Guatemala, and Belize, and only consider the years from 1990 to 2020. Test below and select any countries you would like

Hint: You’ll want to change something in the code that creates migration_filtered.

Part 2: Summarizing the Data. Now that we have our filtered dataset, let’s summarize the data to calculate the mean number of emigrants by country.

What’s the issue with the above code?

HINT: Think about what happens if there are missing values (NA) in the ‘emigrants’ column.

The mean() function by default includes NA values, which will return NA as the result if any NA values are present. We need to handle missing values properly.

Try below and correct the code:

Part 3: Add the summarize function more stats: such as observations, min year, and max year.

Exercise: What could be wrong with this approach?

HINT: Are all the observations being used in the calculation? What about years with missing data?

Part 4: Identifying Missing Data

Let’s investigate how many missing values exist for each country in the ‘emigrants’ variable. This will help us understand how missing data might affect our summary statistics.

Here, we’re counting the number of missing values (n_missing) for each country, and calculating the proportion of missing data (missing_rate).

Part 5: Re-doing the Summary with Improved Understanding

Now, let’s improve our summary by accounting for missing values. We’ll calculate the mean only where data is available, and ensure that our count of observations reflects only those used in the mean calculation.

Part 6: Simplifying the Process

To avoid handling missing data in multiple steps, we can filter out the missing values before grouping and summarizing. This ensures our calculations are straightforward and accurate. Test it in the chunk below

Exercise 2

In this exercise, we will explore how to summarize multiple variables at once using the across() function. We will start with a simple example and gradually build up to more complex summaries, including calculating multiple statistics (mean, standard deviation, etc.). Finally, we’ll create a function to automate this process for any set of variables.

Part 1: Selecting Relevant Variables

Suppose we want to analyze multiple columns: emigrants, and international_migrants. Let’s start by selecting these variables.

Part 2: Summarizing Multiple Variables

We want to calculate the mean of both emigrants and international_migrants for each country. Let’s use the across() function to do this.

Warning

What issue might arise with the above code? HINT: Think about how missing values (NA) are handled in the mean() function.

Explanation: As in the previous exercise, the mean calculation will return NA if there are any missing values. We need to handle these missing values properly.

Part 3: Handling Missing Values

Let’s modify the code to remove the missing values before calculating the mean.

Part 4: Calculating Multiple Statistics

What if we want to calculate additional statistics, such as the standard deviation? We can use the list() function within across() to calculate both the mean and standard deviation.

Part 5 (HARD): Creating a Function

Let’s create a more advanced exercise. What if we want to automate this process so that we can apply it to any set of variables? We’ll create a function that takes two arguments: the dataset we want to analyze and the set of variables we want to summarize.

Hint: When creating the summarize_migration_data function, start by understanding that the purpose is to summarize several variables based on a grouping variable, such as a country. You’ll need to define parameters for the dataset (data), the grouping variable (group_var), and the variables you want to summarize (summary_vars).

To group the data by the specified variable, use group_by(). This should be done with across(all_of(group_var)) to ensure the grouping works dynamically with the variable(s) passed to the function.

Next, you’ll summarize the data using summarize() and across(). Within across(), apply a list of functions to calculate the number of observations (n), mean, standard deviation (sd), minimum (min), and maximum (max). It’s important to handle missing values using na.rm = TRUE for the mean, standard deviation, minimum, and maximum calculations to avoid issues with missing data.

After calculating the summaries, you’ll need to reshape the data for better readability. Use pivot_longer() to transform the summarized data into a long format, where each row represents a combination of a statistic and a variable. Then, use pivot_wider() to pivot the data back into a wide format, with each statistic as a column. This step helps in organizing the results in a structured manner.

Throughout this process, pay attention to how you name the output columns in the summarize() step. The .names argument in across() should be set up to clearly label each statistic with both the function and the variable names, ensuring clarity in the final output.

Finally, ensure that your function is flexible enough to be applied to different datasets and variables by testing it with various inputs. This will confirm that it performs as expected across different scenarios.

Let’s test the function: