ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.4 β readr 2.1.5
β forcats 1.0.0 β stringr 1.5.1
β ggplot2 3.5.1 β tibble 3.2.1
β lubridate 1.9.3 β tidyr 1.3.1
β purrr 1.0.2
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Tutorial 3: Introduction to Tidyverse and the Base Pipe
Introduction to Tidyverse and the Base Pipe
In this tutorial, we will introduce the Tidyverse, a collection of R packages designed for data science. We will also explore the base pipe (|>
), which allows for cleaner and more readable code by chaining operations. This tutorial will guide you through essential data manipulation functions using the dplyr
package within the Tidyverse, with examples focused on education data.
3.1 Introduction to Tidyverse
The Tidyverse is a suite of R packages that work together to simplify data manipulation, exploration, and visualization. Key packages in the Tidyverse include dplyr
, ggplot2
, tidyr
, readr
, and more. In this tutorial, weβll focus on dplyr
, which is used for data manipulation.
3.1.1 Installing and Loading Tidyverse
Before using Tidyverse functions, you need to install and load the package.
# Install Tidyverse (if not already installed)
install.packages("tidyverse")
# Load Tidyverse
library(tidyverse)
Explanation: - The install.packages()
function installs the Tidyverse package if it isnβt already installed on your system. - The library()
function loads the Tidyverse, making its functions available for use.
3.1.2 Introduction to the Base Pipe (|>
)
The base pipe operator |>
was introduced in R 4.1.0. It allows for cleaner, more readable code by enabling a sequence of operations to be chained together.
# Example using base pipe to calculate the mean of a vector
<- c(85, 90, 78, 92, 88)
scores <- scores |> mean()
mean_score mean_score
[1] 86.6
Explanation: - The |>
operator takes the output of the left-hand expression and passes it as the first argument to the function on the right-hand side. - In this example, scores
is passed to the mean()
function to calculate the mean score.
3.2 dplyr Basics
dplyr
is the main package within the Tidyverse for data manipulation. It provides a set of intuitive functions for working with data frames.
3.2.1 select(): Selecting Columns
The select()
function allows you to choose specific columns from a data frame.
# Example data frame
<- data.frame(
students_df Name = c("Alice", "Bob", "Charlie"),
Age = c(20, 21, 19),
Score = c(85, 90, 78),
Major = c("Economics", "History", "Biology")
)
# Selecting Name and Score columns
<- students_df |> select(Name, Score)
selected_df selected_df
Name Score
1 Alice 85
2 Bob 90
3 Charlie 78
Explanation: - select()
is used to pick specific columns from the data frame. - The base pipe |>
passes students_df
to the select()
function.
3.2.2 filter(): Filtering Rows
The filter()
function allows you to filter rows based on specific conditions.
# Filtering students with a score greater than 80
<- students_df |> filter(Score > 80)
filtered_df filtered_df
Name Age Score Major
1 Alice 20 85 Economics
2 Bob 21 90 History
Explanation: - filter()
is used to select rows that meet a condition. Here, only students with a score greater than 80 are included.
3.2.3 arrange(): Arranging Rows
The arrange()
function orders the rows of a data frame based on the values of specified columns.
# Arranging students by score in descending order
<- students_df |> arrange(desc(Score))
arranged_df arranged_df
Name Age Score Major
1 Bob 21 90 History
2 Alice 20 85 Economics
3 Charlie 19 78 Biology
Explanation: - arrange()
orders the rows based on the specified column. desc()
is used to sort in descending order.
3.2.4 mutate(): Creating New Variables
The mutate()
function creates new variables or modifies existing ones within a data frame.
# Adding a new column for grade based on score
<- students_df |> mutate(Grade = ifelse(Score >= 85, "A", "B"))
students_df students_df
Name Age Score Major Grade
1 Alice 20 85 Economics A
2 Bob 21 90 History A
3 Charlie 19 78 Biology B
Explanation: - mutate()
adds a new column Grade
, where students with a score of 85 or higher receive an βAβ grade, and others receive a βBβ.
3.2.5 summarise() and group_by(): Summarizing Data
The summarise()
function, used in conjunction with group_by()
, allows you to compute summary statistics for groups of data.
# Grouping by Major and calculating the average score for each group
<- students_df |>
summary_df group_by(Major) |>
summarise(AverageScore = mean(Score))
summary_df
# A tibble: 3 Γ 2
Major AverageScore
<chr> <dbl>
1 Biology 78
2 Economics 85
3 History 90
Explanation: - group_by()
groups the data by a specific variable (Major
in this case). - summarise()
calculates the mean score for each group.
Exercises and Solutions
Exercise 1: Selecting and Filtering Data
- Using the
students_df
data frame, select only theName
andMajor
columns. - Filter the data to include only students who are majoring in βEconomicsβ.
Solution:
# Step 1: Selecting Name and Major columns
<- students_df |>
selected_df select(Name, Major)
# Step 2: Filtering students majoring in Economics
<- selected_df |>
economics_students_df filter(Major == "Economics")
economics_students_df
Name Major
1 Alice Economics
Exercise 2: Arranging and Mutating Data
- Arrange the
students_df
data frame byAge
in ascending order. - Add a new column called
AgeGroup
that categorizes students as βYoungβ (Age <= 20) or βMatureβ (Age > 20).
Solution:
# Step 1: Arranging by Age
<- students_df |>
arranged_df arrange(Age)
# Step 2: Adding AgeGroup column
<- students_df |>
students_df mutate(AgeGroup = ifelse(Age <= 20, "Young", "Mature"))
Exercise 3: Summarizing Data by Group
- Group the
students_df
data frame byMajor
. - Calculate the total number of students and the average score for each
Major
.
Solution:
# Step 1: Grouping by Major
<- students_df |>
grouped_df group_by(Major)
# Step 2: Summarising total students and average score
<- grouped_df |>
summary_df summarise(
TotalStudents = n(),
AverageScore = mean(Score)
)
Exercise 4: Combining dplyr Functions
- Using
students_df
, filter for students with a score greater than 80, then select theirName
andScore
. - Arrange the result by
Score
in descending order.
Solution:
# Combining filter, select, and arrange
<- students_df |>
result_df filter(Score > 80) |>
select(Name, Score) |>
arrange(desc(Score))
Exercise 5: Applying Multiple Transformations
- Create a new data frame by selecting
Name
,Score
, andMajor
fromstudents_df
. - Filter out students with a score less than 80.
- Add a column
Pass
indicating whether the student passed (Score >= 85). - Arrange the result by
Major
andScore
.
Solution:
# Applying multiple transformations
<- students_df |>
final_df select(Name, Score, Major) |>
filter(Score >= 80) |>
mutate(Pass = Score >= 85) |>
arrange(Major, desc(Score))