ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.4 β readr 2.1.5
β forcats 1.0.0 β stringr 1.5.1
β ggplot2 3.5.1 β tibble 3.2.1
β lubridate 1.9.3 β tidyr 1.3.1
β purrr 1.0.2
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Tutorial 3: Introduction to Tidyverse and the Base Pipe
Introduction to Tidyverse and the Base Pipe
In this tutorial, we will introduce the Tidyverse, a collection of R packages designed for data science. We will also explore the base pipe (|>), which allows for cleaner and more readable code by chaining operations. This tutorial will guide you through essential data manipulation functions using the dplyr package within the Tidyverse, with examples focused on education data.
3.1 Introduction to Tidyverse
The Tidyverse is a suite of R packages that work together to simplify data manipulation, exploration, and visualization. Key packages in the Tidyverse include dplyr, ggplot2, tidyr, readr, and more. In this tutorial, weβll focus on dplyr, which is used for data manipulation.
3.1.1 Installing and Loading Tidyverse
Before using Tidyverse functions, you need to install and load the package.
# Install Tidyverse (if not already installed)
install.packages("tidyverse")
# Load Tidyverse
library(tidyverse)Explanation: - The install.packages() function installs the Tidyverse package if it isnβt already installed on your system. - The library() function loads the Tidyverse, making its functions available for use.
3.1.2 Introduction to the Base Pipe (|>)
The base pipe operator |> was introduced in R 4.1.0. It allows for cleaner, more readable code by enabling a sequence of operations to be chained together.
# Example using base pipe to calculate the mean of a vector
scores <- c(85, 90, 78, 92, 88)
mean_score <- scores |> mean()
mean_score[1] 86.6
Explanation: - The |> operator takes the output of the left-hand expression and passes it as the first argument to the function on the right-hand side. - In this example, scores is passed to the mean() function to calculate the mean score.
3.2 dplyr Basics
dplyr is the main package within the Tidyverse for data manipulation. It provides a set of intuitive functions for working with data frames.
3.2.1 select(): Selecting Columns
The select() function allows you to choose specific columns from a data frame.
# Example data frame
students_df <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(20, 21, 19),
Score = c(85, 90, 78),
Major = c("Economics", "History", "Biology")
)
# Selecting Name and Score columns
selected_df <- students_df |> select(Name, Score)
selected_df Name Score
1 Alice 85
2 Bob 90
3 Charlie 78
Explanation: - select() is used to pick specific columns from the data frame. - The base pipe |> passes students_df to the select() function.
3.2.2 filter(): Filtering Rows
The filter() function allows you to filter rows based on specific conditions.
# Filtering students with a score greater than 80
filtered_df <- students_df |> filter(Score > 80)
filtered_df Name Age Score Major
1 Alice 20 85 Economics
2 Bob 21 90 History
Explanation: - filter() is used to select rows that meet a condition. Here, only students with a score greater than 80 are included.
3.2.3 arrange(): Arranging Rows
The arrange() function orders the rows of a data frame based on the values of specified columns.
# Arranging students by score in descending order
arranged_df <- students_df |> arrange(desc(Score))
arranged_df Name Age Score Major
1 Bob 21 90 History
2 Alice 20 85 Economics
3 Charlie 19 78 Biology
Explanation: - arrange() orders the rows based on the specified column. desc() is used to sort in descending order.
3.2.4 mutate(): Creating New Variables
The mutate() function creates new variables or modifies existing ones within a data frame.
# Adding a new column for grade based on score
students_df <- students_df |> mutate(Grade = ifelse(Score >= 85, "A", "B"))
students_df Name Age Score Major Grade
1 Alice 20 85 Economics A
2 Bob 21 90 History A
3 Charlie 19 78 Biology B
Explanation: - mutate() adds a new column Grade, where students with a score of 85 or higher receive an βAβ grade, and others receive a βBβ.
3.2.5 summarise() and group_by(): Summarizing Data
The summarise() function, used in conjunction with group_by(), allows you to compute summary statistics for groups of data.
# Grouping by Major and calculating the average score for each group
summary_df <- students_df |>
group_by(Major) |>
summarise(AverageScore = mean(Score))
summary_df# A tibble: 3 Γ 2
Major AverageScore
<chr> <dbl>
1 Biology 78
2 Economics 85
3 History 90
Explanation: - group_by() groups the data by a specific variable (Major in this case). - summarise() calculates the mean score for each group.
Exercises and Solutions
Exercise 1: Selecting and Filtering Data
- Using the
students_dfdata frame, select only theNameandMajorcolumns. - Filter the data to include only students who are majoring in βEconomicsβ.
Solution:
# Step 1: Selecting Name and Major columns
selected_df <- students_df |>
select(Name, Major)
# Step 2: Filtering students majoring in Economics
economics_students_df <- selected_df |>
filter(Major == "Economics")
economics_students_df Name Major
1 Alice Economics
Exercise 2: Arranging and Mutating Data
- Arrange the
students_dfdata frame byAgein ascending order. - Add a new column called
AgeGroupthat categorizes students as βYoungβ (Age <= 20) or βMatureβ (Age > 20).
Solution:
# Step 1: Arranging by Age
arranged_df <- students_df |>
arrange(Age)
# Step 2: Adding AgeGroup column
students_df <- students_df |>
mutate(AgeGroup = ifelse(Age <= 20, "Young", "Mature"))Exercise 3: Summarizing Data by Group
- Group the
students_dfdata frame byMajor. - Calculate the total number of students and the average score for each
Major.
Solution:
# Step 1: Grouping by Major
grouped_df <- students_df |>
group_by(Major)
# Step 2: Summarising total students and average score
summary_df <- grouped_df |>
summarise(
TotalStudents = n(),
AverageScore = mean(Score)
)Exercise 4: Combining dplyr Functions
- Using
students_df, filter for students with a score greater than 80, then select theirNameandScore. - Arrange the result by
Scorein descending order.
Solution:
# Combining filter, select, and arrange
result_df <- students_df |>
filter(Score > 80) |>
select(Name, Score) |>
arrange(desc(Score))Exercise 5: Applying Multiple Transformations
- Create a new data frame by selecting
Name,Score, andMajorfromstudents_df. - Filter out students with a score less than 80.
- Add a column
Passindicating whether the student passed (Score >= 85). - Arrange the result by
MajorandScore.
Solution:
# Applying multiple transformations
final_df <- students_df |>
select(Name, Score, Major) |>
filter(Score >= 80) |>
mutate(Pass = Score >= 85) |>
arrange(Major, desc(Score))