Guide to Tidyverse Functions

Author

Rony Rodriguez-Ramirez

Introduction

The Tidyverse is a collection of R packages designed for data science, offering powerful tools for data manipulation, exploration, and visualization. The functions within the Tidyverse are well-documented, and you can easily access detailed information by using ?function_name in the R console, or help(function_name). For instance, typing ?mutate will bring up the documentation for the mutate function.

Tidyverse Functions

Function Example Usage Description
filter filter(df, condition) Subsets rows in a data frame based on a condition.
select select(df, column1, column2) Selects specific columns from a data frame.
mutate mutate(df, new_column = expression) Adds new variables or transforms existing ones based on an expression.
arrange arrange(df, column) Reorders rows of a data frame according to one or more columns.
summarize summarize(df, new_column = summary_func) Reduces multiple values down to a single summary (e.g., mean, sum) for each group.
group_by group_by(df, column) Groups the data by one or more variables to perform summary functions or operations by group.
left_join left_join(df1, df2, by = "column") Joins two data frames by a common column, keeping all rows from the first data frame.
right_join right_join(df1, df2, by = "column") Joins two data frames by a common column, keeping all rows from the second data frame.
inner_join inner_join(df1, df2, by = "column") Joins two data frames by a common column, keeping only rows with matches in both data frames.
full_join full_join(df1, df2, by = "column") Joins two data frames by a common column, keeping all rows from both data frames.
anti_join anti_join(df1, df2, by = "column") Returns rows from the first data frame that do not have matches in the second data frame.
semi_join semi_join(df1, df2, by = "column") Returns rows from the first data frame that have matches in the second data frame.
bind_rows bind_rows(df1, df2) Combines multiple data frames by rows.
bind_cols bind_cols(df1, df2) Combines multiple data frames by columns.
pivot_longer pivot_longer(df, cols, names_to, values_to) Converts data from wide to long format.
pivot_wider pivot_wider(df, names_from, values_from) Converts data from long to wide format.
rename rename(df, new_name = old_name) Renames columns in a data frame.
rename_with rename_with(df, .fn, .cols = everything()) Renames columns using a function.
distinct distinct(df, column) Returns distinct rows of a data frame based on one or more columns.
slice slice(df, n) Selects rows by position.
slice_head slice_head(df, n = 5) Selects the first n rows of a data frame.
slice_tail slice_tail(df, n = 5) Selects the last n rows of a data frame.
slice_sample slice_sample(df, n = 10) Randomly selects n rows from a data frame.
slice_min slice_min(df, column) Selects the rows with the minimum value of a column.
slice_max slice_max(df, column) Selects the rows with the maximum value of a column.
pull pull(df, column) Extracts a single column from a data frame as a vector.
count count(df, column) Counts the number of occurrences for each unique value of a column.
n summarize(df, count = n()) Returns the number of rows in each group.
n_distinct summarize(df, unique_count = n_distinct(column)) Counts the number of distinct values in a column.
if_else mutate(df, new_column = if_else(condition, true, false)) Vectorized conditional operation.
case_when mutate(df, new_column = case_when(condition1 ~ result1, condition2 ~ result2)) Multiple conditional operations, similar to if_else, but with more flexibility.
coalesce mutate(df, new_column = coalesce(column1, column2)) Returns the first non-missing value in a set of columns.
between filter(df, between(column, lower, upper)) Filters rows where a column value falls between two boundaries.
all_of select(df, all_of(c("column1", "column2"))) Selects columns by name from a vector of names.
any_of select(df, any_of(c("column1", "column2"))) Selects columns by name from a vector of names, ignoring missing columns.
everything select(df, everything()) Selects all columns in a data frame.
relocate relocate(df, column1, .before = column2) Changes the order of columns in a data frame.
separate separate(df, column, into = c("col1", "col2"), sep = "-") Splits a single column into multiple columns based on a separator.
unite unite(df, new_column, col1, col2, sep = "-") Combines multiple columns into a single column.
crossing crossing(df1, df2) Creates a data frame with all combinations of rows from the input data frames.
expand_grid expand_grid(col1 = 1:3, col2 = letters[1:3]) Creates a data frame with all combinations of values in the input vectors.
nest nest(df, data = c(column1, column2)) Converts a data frame into a nested data frame, where columns are combined into a list-column.
unnest unnest(df, data) Expands a list-column back into regular columns.
map map(vector, function) Applies a function to each element of a list or vector.
map_df map_df(vector, function) Applies a function to each element of a list or vector and returns the results as a data frame.
reduce reduce(list, function) Combines elements of a list using a binary function (e.g., sum, union).
accumulate accumulate(vector, function) Sequentially applies a binary function to the elements of a vector, returning all intermediate results.

ggplot2 Components

ggplot2 is a powerful and flexible package for creating data visualizations in R. It is part of the Tidyverse and is based on the grammar of graphics, which allows you to build complex plots layer by layer. The core function in ggplot2 is ggplot(), which initializes a plot object that can be built upon using various components like geoms, scales, aesthetics, and labels.

To learn more about each function, you can use ?function_name or help(function_name) in the R console.

Component Example Usage Description
ggplot ggplot(data, aes(x = var1, y = var2)) Initializes a ggplot object with data and mappings for aesthetics.
aes aes(x = var1, y = var2, color = var3) Specifies the aesthetic mappings (e.g., x and y axes, colors, sizes).
geom_point geom_point() Adds a scatter plot layer to the plot.
geom_line geom_line() Adds a line plot layer to the plot.
geom_bar geom_bar(stat = "identity") Adds a bar plot layer to the plot. The stat = "identity" argument tells ggplot to use actual values.
geom_histogram geom_histogram(binwidth = 1) Adds a histogram layer to the plot, with a specified bin width.
geom_boxplot geom_boxplot() Adds a boxplot layer to the plot.
geom_smooth geom_smooth(method = "lm") Adds a smoothed line (e.g., regression line) to the plot. The method = "lm" argument fits a linear model.
geom_col geom_col() Adds a bar plot layer to the plot, similar to geom_bar() but without a default stat.
geom_density geom_density() Adds a density plot layer to the plot, showing the distribution of a continuous variable.
geom_jitter geom_jitter(width = 0.1, height = 0.1) Adds a jittered scatter plot layer, useful for avoiding overplotting by adding small random noise.
scale_x_continuous scale_x_continuous(limits = c(0, 100)) Adjusts the x-axis scale for continuous data, with specified limits.
scale_y_continuous scale_y_continuous(limits = c(0, 100)) Adjusts the y-axis scale for continuous data, with specified limits.
scale_fill_manual scale_fill_manual(values = c("red", "blue")) Manually sets the fill colors for a plot, typically used with discrete variables.
scale_color_manual scale_color_manual(values = c("red", "blue")) Manually sets the colors for lines, points, etc., typically used with discrete variables.
scale_x_date scale_x_date(date_labels = "%b %Y") Adjusts the x-axis scale for date data, with custom date labels.
scale_x_log10 scale_x_log10() Sets the x-axis scale to a logarithmic scale (base 10).
labs labs(title = "Title", x = "X-axis label", y = "Y-axis label") Adds or customizes titles and axis labels for the plot.
theme theme(axis.text.x = element_text(angle = 45)) Customizes the appearance of the plot, including text, axis labels, legends, and more.
theme_minimal theme_minimal() Applies a minimalistic theme to the plot, with fewer gridlines and a clean look.
theme_classic theme_classic() Applies a classic theme with a simple, clean appearance, similar to base R plots.
facet_wrap facet_wrap(~ variable) Creates a grid of plots, with each plot representing a subset of the data based on a variable.
facet_grid facet_grid(rows ~ cols) Creates a grid of plots, with rows and columns determined by two different variables.
coord_flip coord_flip() Flips the x and y axes, useful for horizontal bar plots.
coord_cartesian coord_cartesian(xlim = c(0, 100), ylim = c(0, 50)) Zooms in on a specific area of the plot by setting limits on the x and y axes.
coord_polar coord_polar(theta = "x") Converts a plot into polar coordinates, useful for creating pie charts or circular bar plots.
annotate annotate("text", x = 5, y = 10, label = "Note") Adds annotations such as text, shapes, or arrows to the plot.
ggsave ggsave("plot.png", plot = last_plot()) Saves the last plot created to a file, with options for file type, size, and resolution.

Conclusion

The Tidyverse provides a comprehensive set of functions for data manipulation, making it easier to work with data in R. This guide covers many of the core functions you’ll use in your data analysis tasks. Always refer to the documentation to explore additional arguments and use cases for each function.