library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)
Rony Rodriguez-Ramirez
In this tutorial, we will delve deeper into data analysis using the Tidyverse, focusing on more advanced data manipulation techniques. We will explore grouping data, summarizing results, and transforming data into a tidy format. These techniques are crucial for conducting meaningful data analysis, particularly in the context of educational data. We will load the dplyr
and tidyr
packages from the tidyverse for this tutorial. Remember that if you haven’t installed the tidyverse, you can use install.packages()
.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
One of the most powerful features of dplyr
is the ability to group data by one or more variables and then summarize each group using a variety of summary statistics.
The group_by()
function allows you to group data by one or more variables. This is often the first step in data analysis when you want to calculate summary statistics for different groups within your dataset.
# Example data frame
students_df <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Major = c("Economics", "Economics", "History", "Biology", "History"),
Score = c(85, 90, 78, 88, 92)
)
# Grouping by Major
grouped_df <- students_df |>
group_by(Major)
grouped_df
# A tibble: 5 × 3
# Groups: Major [3]
Name Major Score
<chr> <chr> <dbl>
1 Alice Economics 85
2 Bob Economics 90
3 Charlie History 78
4 David Biology 88
5 Eva History 92
Explanation: - group_by()
creates a grouped data frame where operations can be applied separately to each group. - In this example, students are grouped by their Major
.
Once the data is grouped, you can use the summarise()
function to calculate summary statistics, such as the mean, median, count, etc., for each group.
# Calculating the average score for each major
summary_df <- grouped_df |>
summarise(AverageScore = mean(Score))
summary_df
# A tibble: 3 × 2
Major AverageScore
<chr> <dbl>
1 Biology 88
2 Economics 87.5
3 History 85
Explanation: - summarise()
creates a new data frame with summary statistics calculated for each group. Here, the mean score is calculated for each Major
.
You can combine group_by()
and summarise()
to perform complex analyses on your data.
# Calculating both the average score and the number of students in each major
summary_df <- students_df |>
group_by(Major) |>
summarise(
AverageScore = mean(Score),
StudentCount = n()
)
summary_df
# A tibble: 3 × 3
Major AverageScore StudentCount
<chr> <dbl> <int>
1 Biology 88 1
2 Economics 87.5 2
3 History 85 2
Explanation: - This example shows how to calculate multiple summary statistics at once. The n()
function counts the number of observations in each group.
The mutate()
function allows you to create new variables or modify existing ones, while transmute()
does the same but only keeps the newly created variables.
# Adding a new column for standardized scores
students_df <- students_df |>
mutate(
StandardizedScore = (Score - mean(Score)) / sd(Score)
)
students_df
Name Major Score StandardizedScore
1 Alice Economics 85 -0.2930973
2 Bob Economics 90 0.6228318
3 Charlie History 78 -1.5753981
4 David Biology 88 0.2564602
5 Eva History 92 0.9892035
Explanation: - mutate()
creates a new column StandardizedScore
, which standardizes the Score
by subtracting the mean and dividing by the standard deviation.
If you only want to keep the newly created variables, use transmute()
.
# Keeping only the standardized scores
standardized_df <- students_df |>
transmute(
Name, StandardizedScore = (Score - mean(Score)) / sd(Score)
)
standardized_df
Name StandardizedScore
1 Alice -0.2930973
2 Bob 0.6228318
3 Charlie -1.5753981
4 David 0.2564602
5 Eva 0.9892035
Explanation: - transmute()
creates the StandardizedScore
column and drops all other columns except Name
.
Data often needs to be reshaped from wide to long format or vice versa. The pivot_longer()
and pivot_wider()
functions from the tidyr
package are used for this purpose.
# Example wide data frame
wide_df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Math_Score = c(85, 90, 78),
History_Score = c(88, 80, 90)
)
# Converting to long format
long_df <- wide_df |>
pivot_longer(
cols = contains("Score"),
names_to = "Subject",
values_to = "Score"
)
long_df
# A tibble: 6 × 3
Name Subject Score
<chr> <chr> <dbl>
1 Alice Math_Score 85
2 Alice History_Score 88
3 Bob Math_Score 90
4 Bob History_Score 80
5 Charlie Math_Score 78
6 Charlie History_Score 90
Explanation: - pivot_longer()
converts the data from wide format (separate columns for each subject) to long format (one column for subjects and one for scores). - The cols
argument specifies which columns to pivot, names_to
specifies the name of the new variable that will contain the original column names, and values_to
specifies the name of the variable that will contain the values.
# Converting long format back to wide format
wide_again_df <- long_df |>
pivot_wider(
names_from = Subject,
values_from = Score
)
wide_again_df
# A tibble: 3 × 3
Name Math_Score History_Score
<chr> <dbl> <dbl>
1 Alice 85 88
2 Bob 90 80
3 Charlie 78 90
Explanation: - pivot_wider()
is the opposite of pivot_longer()
. It spreads key-value pairs across multiple columns, converting the data back to wide format.
students_df
data frame, group the data by Major
.Major
.Solution:
students_df
called ScoreCategory
that categorizes scores as “High” (>= 85) or “Low” (< 85).ScoreCategory
column to reflect “Very High” for scores >= 90.Solution:
# Step 1: Adding ScoreCategory column
students_df <- students_df |>
mutate(
ScoreCategory = ifelse(Score >= 85, "High", "Low")
)
# Step 2: Modifying ScoreCategory for Very High scores
students_df <- students_df |>
mutate(
ScoreCategory = ifelse(Score >= 90, "Very High", ScoreCategory)
)
students_df
Name Major Score StandardizedScore ScoreCategory
1 Alice Economics 85 -0.2930973 High
2 Bob Economics 90 0.6228318 Very High
3 Charlie History 78 -1.5753981 Low
4 David Biology 88 0.2564602 High
5 Eva History 92 0.9892035 Very High
wide_df
data frame from wide to long format using pivot_longer()
.Subject
column to Course
in the long data frame.Solution:
# Step 1: Pivoting from wide to long format
long_df <- wide_df |>
pivot_longer(
cols = contains("Score"),
names_to = "Subject",
values_to = "Score"
)
# Step 2: Renaming Subject to Course
long_df <- long_df |>
rename(Course = Subject)
long_df
# A tibble: 6 × 3
Name Course Score
<chr> <chr> <dbl>
1 Alice Math_Score 85
2 Alice History_Score 88
3 Bob Math_Score 90
4 Bob History_Score 80
5 Charlie Math_Score 78
6 Charlie History_Score 90
students_df
, create a new column called AdjustedScore
where each score is increased by 5%.Major
and calculate the average AdjustedScore
for each Major
.Solution:
long_df
, convert the data back to wide format with pivot_wider()
.Name
as rows and Course
as columns with scores as values.Solution: