Guide to Tidyverse Functions
Introduction
The Tidyverse is a collection of R packages designed for data science, offering powerful tools for data manipulation, exploration, and visualization. The functions within the Tidyverse are well-documented, and you can easily access detailed information by using ?function_name
in the R console, or help(function_name)
. For instance, typing ?mutate
will bring up the documentation for the mutate
function.
Tidyverse Functions
Function | Example Usage | Description |
---|---|---|
filter |
filter(df, condition) |
Subsets rows in a data frame based on a condition. |
select |
select(df, column1, column2) |
Selects specific columns from a data frame. |
mutate |
mutate(df, new_column = expression) |
Adds new variables or transforms existing ones based on an expression. |
arrange |
arrange(df, column) |
Reorders rows of a data frame according to one or more columns. |
summarize |
summarize(df, new_column = summary_func) |
Reduces multiple values down to a single summary (e.g., mean, sum) for each group. |
group_by |
group_by(df, column) |
Groups the data by one or more variables to perform summary functions or operations by group. |
left_join |
left_join(df1, df2, by = "column") |
Joins two data frames by a common column, keeping all rows from the first data frame. |
right_join |
right_join(df1, df2, by = "column") |
Joins two data frames by a common column, keeping all rows from the second data frame. |
inner_join |
inner_join(df1, df2, by = "column") |
Joins two data frames by a common column, keeping only rows with matches in both data frames. |
full_join |
full_join(df1, df2, by = "column") |
Joins two data frames by a common column, keeping all rows from both data frames. |
anti_join |
anti_join(df1, df2, by = "column") |
Returns rows from the first data frame that do not have matches in the second data frame. |
semi_join |
semi_join(df1, df2, by = "column") |
Returns rows from the first data frame that have matches in the second data frame. |
bind_rows |
bind_rows(df1, df2) |
Combines multiple data frames by rows. |
bind_cols |
bind_cols(df1, df2) |
Combines multiple data frames by columns. |
pivot_longer |
pivot_longer(df, cols, names_to, values_to) |
Converts data from wide to long format. |
pivot_wider |
pivot_wider(df, names_from, values_from) |
Converts data from long to wide format. |
rename |
rename(df, new_name = old_name) |
Renames columns in a data frame. |
rename_with |
rename_with(df, .fn, .cols = everything()) |
Renames columns using a function. |
distinct |
distinct(df, column) |
Returns distinct rows of a data frame based on one or more columns. |
slice |
slice(df, n) |
Selects rows by position. |
slice_head |
slice_head(df, n = 5) |
Selects the first n rows of a data frame. |
slice_tail |
slice_tail(df, n = 5) |
Selects the last n rows of a data frame. |
slice_sample |
slice_sample(df, n = 10) |
Randomly selects n rows from a data frame. |
slice_min |
slice_min(df, column) |
Selects the rows with the minimum value of a column. |
slice_max |
slice_max(df, column) |
Selects the rows with the maximum value of a column. |
pull |
pull(df, column) |
Extracts a single column from a data frame as a vector. |
count |
count(df, column) |
Counts the number of occurrences for each unique value of a column. |
n |
summarize(df, count = n()) |
Returns the number of rows in each group. |
n_distinct |
summarize(df, unique_count = n_distinct(column)) |
Counts the number of distinct values in a column. |
if_else |
mutate(df, new_column = if_else(condition, true, false)) |
Vectorized conditional operation. |
case_when |
mutate(df, new_column = case_when(condition1 ~ result1, condition2 ~ result2)) |
Multiple conditional operations, similar to if_else , but with more flexibility. |
coalesce |
mutate(df, new_column = coalesce(column1, column2)) |
Returns the first non-missing value in a set of columns. |
between |
filter(df, between(column, lower, upper)) |
Filters rows where a column value falls between two boundaries. |
all_of |
select(df, all_of(c("column1", "column2"))) |
Selects columns by name from a vector of names. |
any_of |
select(df, any_of(c("column1", "column2"))) |
Selects columns by name from a vector of names, ignoring missing columns. |
everything |
select(df, everything()) |
Selects all columns in a data frame. |
relocate |
relocate(df, column1, .before = column2) |
Changes the order of columns in a data frame. |
separate |
separate(df, column, into = c("col1", "col2"), sep = "-") |
Splits a single column into multiple columns based on a separator. |
unite |
unite(df, new_column, col1, col2, sep = "-") |
Combines multiple columns into a single column. |
crossing |
crossing(df1, df2) |
Creates a data frame with all combinations of rows from the input data frames. |
expand_grid |
expand_grid(col1 = 1:3, col2 = letters[1:3]) |
Creates a data frame with all combinations of values in the input vectors. |
nest |
nest(df, data = c(column1, column2)) |
Converts a data frame into a nested data frame, where columns are combined into a list-column. |
unnest |
unnest(df, data) |
Expands a list-column back into regular columns. |
map |
map(vector, function) |
Applies a function to each element of a list or vector. |
map_df |
map_df(vector, function) |
Applies a function to each element of a list or vector and returns the results as a data frame. |
reduce |
reduce(list, function) |
Combines elements of a list using a binary function (e.g., sum, union). |
accumulate |
accumulate(vector, function) |
Sequentially applies a binary function to the elements of a vector, returning all intermediate results. |
ggplot2 Components
ggplot2
is a powerful and flexible package for creating data visualizations in R. It is part of the Tidyverse and is based on the grammar of graphics, which allows you to build complex plots layer by layer. The core function in ggplot2
is ggplot()
, which initializes a plot object that can be built upon using various components like geoms, scales, aesthetics, and labels.
To learn more about each function, you can use ?function_name
or help(function_name)
in the R console.
Component | Example Usage | Description |
---|---|---|
ggplot |
ggplot(data, aes(x = var1, y = var2)) |
Initializes a ggplot object with data and mappings for aesthetics. |
aes |
aes(x = var1, y = var2, color = var3) |
Specifies the aesthetic mappings (e.g., x and y axes, colors, sizes). |
geom_point |
geom_point() |
Adds a scatter plot layer to the plot. |
geom_line |
geom_line() |
Adds a line plot layer to the plot. |
geom_bar |
geom_bar(stat = "identity") |
Adds a bar plot layer to the plot. The stat = "identity" argument tells ggplot to use actual values. |
geom_histogram |
geom_histogram(binwidth = 1) |
Adds a histogram layer to the plot, with a specified bin width. |
geom_boxplot |
geom_boxplot() |
Adds a boxplot layer to the plot. |
geom_smooth |
geom_smooth(method = "lm") |
Adds a smoothed line (e.g., regression line) to the plot. The method = "lm" argument fits a linear model. |
geom_col |
geom_col() |
Adds a bar plot layer to the plot, similar to geom_bar() but without a default stat. |
geom_density |
geom_density() |
Adds a density plot layer to the plot, showing the distribution of a continuous variable. |
geom_jitter |
geom_jitter(width = 0.1, height = 0.1) |
Adds a jittered scatter plot layer, useful for avoiding overplotting by adding small random noise. |
scale_x_continuous |
scale_x_continuous(limits = c(0, 100)) |
Adjusts the x-axis scale for continuous data, with specified limits. |
scale_y_continuous |
scale_y_continuous(limits = c(0, 100)) |
Adjusts the y-axis scale for continuous data, with specified limits. |
scale_fill_manual |
scale_fill_manual(values = c("red", "blue")) |
Manually sets the fill colors for a plot, typically used with discrete variables. |
scale_color_manual |
scale_color_manual(values = c("red", "blue")) |
Manually sets the colors for lines, points, etc., typically used with discrete variables. |
scale_x_date |
scale_x_date(date_labels = "%b %Y") |
Adjusts the x-axis scale for date data, with custom date labels. |
scale_x_log10 |
scale_x_log10() |
Sets the x-axis scale to a logarithmic scale (base 10). |
labs |
labs(title = "Title", x = "X-axis label", y = "Y-axis label") |
Adds or customizes titles and axis labels for the plot. |
theme |
theme(axis.text.x = element_text(angle = 45)) |
Customizes the appearance of the plot, including text, axis labels, legends, and more. |
theme_minimal |
theme_minimal() |
Applies a minimalistic theme to the plot, with fewer gridlines and a clean look. |
theme_classic |
theme_classic() |
Applies a classic theme with a simple, clean appearance, similar to base R plots. |
facet_wrap |
facet_wrap(~ variable) |
Creates a grid of plots, with each plot representing a subset of the data based on a variable. |
facet_grid |
facet_grid(rows ~ cols) |
Creates a grid of plots, with rows and columns determined by two different variables. |
coord_flip |
coord_flip() |
Flips the x and y axes, useful for horizontal bar plots. |
coord_cartesian |
coord_cartesian(xlim = c(0, 100), ylim = c(0, 50)) |
Zooms in on a specific area of the plot by setting limits on the x and y axes. |
coord_polar |
coord_polar(theta = "x") |
Converts a plot into polar coordinates, useful for creating pie charts or circular bar plots. |
annotate |
annotate("text", x = 5, y = 10, label = "Note") |
Adds annotations such as text, shapes, or arrows to the plot. |
ggsave |
ggsave("plot.png", plot = last_plot()) |
Saves the last plot created to a file, with options for file type, size, and resolution. |
Conclusion
The Tidyverse provides a comprehensive set of functions for data manipulation, making it easier to work with data in R. This guide covers many of the core functions you’ll use in your data analysis tasks. Always refer to the documentation to explore additional arguments and use cases for each function.