Week 02
Tidy Data and Visualization

API209: Summer Math Camp

Rony Rodrigo Maximiliano Rodriguez-Ramirez

Harvard University

August 20, 2024

Recap and Tidy Data

Wrangling your data {Recap}

  • You are highly encouraged to read through Hadley Wickham’s chapter. It’s clear and concise.

  • Also check out this great “cheatsheet” here.

  • The package is organized around a set of verbs, i.e. actions to be taken.

  • All verbs work as follows:

\[\text{verb}(\underbrace{\text{data.frame}}_{\text{1st argument}}, \underbrace{\text{what to do}}_\text{2nd argument})\]

  • Alternatively you can (should) use the pipe operator %>%:

\[\underbrace{\text{data.frame}}_{\text{1st argument}} \underbrace{\text{ %>% }}_{\text{"pipe" operator}} \text{verb}(\underbrace{\text{what to do}}_\text{2nd argument})\]

Tidy data

  • In most cases, your datasets won’t be tidy.

Tidy data: A dataset is said to be tidy if it satisfies the following conditions:

Untidy data is pretty common

However, storing data in wide form is easier to display in a printed table.

Tidy data
is data in
long format

Beautiful visualizations

What makes a great visualization?

Truthful

Functional

Beautiful

Insightful

Enlightening

Alberto Cairo, The Truthful Art

How do we express visuals in words?

  • Data to be visualized

  • Geometric objects that appear on the plot

  • Aesthetic mappings from data to visual component

  • Statistics transform data on the way to visualization

  • Coordinates organize location of geometric objects

  • Scales define the range of values for aesthetics

  • Facets group into subplots

What makes a great visualization?

Good aesthetics

No substantive issues

No perceptual issues

Honesty + good judgment

Kieran Healy, Data Visualization: A Practical Introduction

You see bad plots everywhere: What’s wrong?

Is this right?

Entering ggplot

ggplot

For this session, you’ll use the ggplot2 package from the tidyverse meta-package.

  • So, you can just load the tidyverse package when using ggplot.
  1. Consistency with the Grammar of Graphics
    • This book is the foundation of several data viz applications: ggplot2, polaris-tableau, vega-lite
  2. Flexibility
  3. Layering and theme customization
  4. Community

It is a powerful and easy to use tool (once you understand its logic) that produces complex and multifaceted plots.

ggplot2: basic structure (template)

The basic ggplot structure is:

ggplot(data = DATA) +
  GEOM_FUNCTION(mapping = aes(AESTHETIC MAPPINGS))

Mapping data to aesthetics

Think about colors, sizes, x and y references

We are going to learn how we connect our data to the components of a ggplot.

I usually code like this:

DATA |> 
  ggplot(aes(AESTHETIC MAPPINGS)) +
  GEOM_FUNCTION()

Mapping

Mappings do not directly specify the particular, e.g., colors, shapes, or line styles that will appear on the plot.

Rather, they establish which variables in the data will be represented by which visible elements on the plot.

ggplot2: full structure

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>,
     position = <POSITION>
  ) +z
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <SCALE_FUNTION> +
  <THEME_FUNCTION>
  1. Data: The data that you want to visualize
  2. Layers: geom_ and stat_ → The geometric shapes and statistical summaries representing the data
  3. Aesthetics: aes() → Aesthetic mappings of the geometric and statistical objects
  4. Scales: scale_ → Maps between the data and the aesthetic dimensions
  5. Coordinate system: coord_ → Maps data into the plane of the data rectangle
  6. Facets: facet_ → The arrangement of the data into a grid of plots
  7. Visual themes: theme() and theme_ → The overall visual defaults of a plot

ggplot2: decomposition

There are multiple ways to structure plots with ggplot

For this presentation, I will stick to Thomas Lin Pedersen’s decomposition who is one of most prominent developers of the ggplot and gganimate package.

These components can be seen as layers, this is why we use the + sign in our ggplot syntax.

Exploratory Analysis

The most common geoms are:

  • geom_bar(), geom_col(): bar charts.
  • geom_boxplot(): box and whiskers plots.
  • geom_density(): density estimates.
  • geom_jitter(): jittered points.
  • geom_line(): line plots.
  • geom_point(): scatter plots.

If you want to know more about layers, you can refer to this.

Step by step from Garrick Aden-Buie’s gentle guide

Using the gapminder package, let’s start with lifeExp vs gdpPercap

glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

ggplot(gapminder) 

The Canvas

ggplot(gapminder) +
  aes(
    x = gdpPercap
  )

The Canvas

ggplot(gapminder) +
  aes(
    x = gdpPercap,
    y = lifeExp
  )

Let’s add points…

ggplot(gapminder) +
  aes(
    x = gdpPercap,
    y = lifeExp
  ) + 
  geom_point() 

How can I tell countries apart? GDP is squished together on the left

ggplot(gapminder) +
  aes(
    x = gdpPercap,
    y = lifeExp
  ) + 
  geom_point() +
  scale_x_log10()

Still lots of overlap in the countries…

ggplot(gapminder) +
  aes(
    x = gdpPercap,
    y = lifeExp,
    color = continent
  ) +
  geom_point() +
  scale_x_log10() +
  facet_wrap(~ continent) +
  guides(color = FALSE)    

No need for color legend thanks to facet titles.

Lots of overplotting due to point size.

ggplot(gapminder) +
  aes(
    x = gdpPercap,
    y = lifeExp,
    color = continent
  ) +
  geom_point(size = 0.25) + 
  scale_x_log10() +
  facet_wrap(~ continent) +
  guides(color = FALSE)

Is there a trend?

ggplot(gapminder) +
  aes(
    x = gdpPercap,
    y = lifeExp,
    color = continent
  ) +
  geom_line() + #<<
  geom_point(size = 0.25) +
  scale_x_log10() +
  facet_wrap(~ continent) +
  guides(color = FALSE)

Okay, that line just connected all of the points sequentially…

ggplot(gapminder) +
  aes(
    x = gdpPercap,
    y = lifeExp,
    color = continent
  ) +
  geom_line(
    aes(group = country)
  ) +
  geom_point(size = 0.25) +
  scale_x_log10() +
  facet_wrap(~ continent) +
  guides(color = FALSE)

Oh no! Too confusing!

ggplot(gapminder) +
  aes(
    x = year,
    y = lifeExp,
    color = continent
  ) +
  geom_line(
    aes(group = country)
  ) +
  geom_point(size = 0.25) +
  scale_y_log10() +
  facet_wrap(~ continent) +
  guides(color = FALSE)

Let’s add year in our x-axis instead of gdp!

Our goal