Graphite:   a guided tour
1 Deciding what library to use
2 Key forms
3 Gapminder
4 Bar charts
5 Faceting
8.1

Graphite: a guided tour

Graphite is a library for data visualization, designed to create common types of plots with as little effort as possible, and allowing for easy manipulation of the plot around the data without having to change the structure of the data itself. The interface is loosely modeled around the tidyverse’s ggplot2, so if you’re familiar with it, you should feel at home.

This tutorial is modeled around Kieran Healy’s work Data Visualization: A practical introduction, which uses ggplot2 and R rather than Graphite and Racket.

    1 Deciding what library to use

    2 Key forms

    3 Gapminder

    4 Bar charts

    5 Faceting

1 Deciding what library to use

Graphite is implemented on top of plot, and in no way serves as a replacement to it, instead serving as a complement. As a consequence, Graphite’s functionality is in no way a strict superset of plot.

If your data visualization satisfies some of the following criteria, Graphite would be a good fit:
  • Your data is comprised of discrete points. Graphite requires all data to be read into a data-frame before creating any visualization. This means that if your plot consists of continuous data (e.g. a function), Graphite is unlikely to fit your needs.

  • Your plot is intended for use as an image. Graphite exports all plots as a pict. For various reasons, it does not support the interactivity that plot snips provide. Graphite’s generated plots work great in Scribble, or embedded into any other document by saving the file using save-pict.

  • Your intended plot is 2D. Graphite does not support 3D plots, and it probably never will.

  • Your data is tidy, and consists of the data you actually want to show. Graphite assumes that your data is in "tidy" form, with each variable being a column and each observation being a row. Data wrangling is outside of the scope of Graphite’s functionality.

  • You don’t know what data visualization method you want to use. Graphite’s main goal is to prevent significant structural changes when, for example, switching from a scatter plot to a histogram.

If your data is untidy, it will require further processing before being read into a data-frame. Anecdotally, this is not particularly easy in Racket. All data in this tutorial is already tidy.

If your data is primarily continuous, needs to be interactive, or needs to be 3D, plot is likely to be a better fit.

2 Key forms

Graphite has a number of important forms that are effectively required for anything to come out of it.

First off, in order to plot data, we need to read data in. To do this, Graphite uses the data-frame library. Usually, to read in data from a CSV, you use df-read/csv. Various other formats are supported by data-frame, which you can find by consulting its documentation.

The main form of Graphite-generated plots is the function graph. It takes two mandatory keywords, #:data and #:mapping, which correspond to (respectively) the data-frame? to read your data from and the aes? to retrieve global aesthetics from (more on that later). Finally, it takes any amount of renderers, such as points and lines, as positional arguments. As an example, a usual call to graph would look like:
(graph #:data some-data
       #:mapping (aes #:some-key some-value)
       #:title "a title"
       (renderer-1)
       (renderer-2))

Aesthetic mappings, created with the aes form, creates a set of key-value mappings, with keys specified by keywords. Each key represents an "aesthetic" to map a variable to, and each value represents a variable in the data to map it to. If a value is not present and not mandatory, it is read as #f. These can correspond to positional aesthetics (the x-axis with #:x, and the y-axis with #:y), colorings (such as #:discrete-color to points), or another required variable (such as #:perc-error to error-bars). The most important thing is that aesthetic mappings only correspond to mapping an aesthetic to a variable: if you want to set it as a constant, you likely want a regular keyword argument.

graph takes a global aesthetic mapping, and each renderer takes its own aesthetic mapping as well, all with the #:mapping keyword. When a renderer is called, it inherits aesthetics from both the global mapping passed to graph and the local mapping passed to it. In addition, the local mapping overrides the global mapping’s settings. So, for example, in this code:
(graph #:data some-data
       #:mapping (aes #:foo "bar" #:baz "quux")
       (renderer-1 #:mapping (aes #:baz "waldo" #:fruit "apple")))
the aesthetic mapping that is actually used in renderer-1 would be (aes #:foo "bar" #:baz "waldo" #:fruit "apple"), inheriting #:foo from the global mapping and overriding #:baz.

3 Gapminder

All data visualization starts with data to visualize, and we begin with excerpts of data from Gapminder: more specifically, we begin with a CSV dump of the data in the Gapminder library for R. This data is already tidy and in the format we want, so we merely read it in as a CSV using df-read/csv from the data-frame library:

> (define gapminder (df-read/csv "data/all_gapminder.csv"))

We can take a quick look at our data using the df-describe function, to determine what data we actually have:
> (df-describe gapminder)

data-frame: 7 columns, 1704 rows

properties:

series:

              NAs           min           max          mean        stddev

  #f            0        +inf.0        -inf.0        +nan.0        +nan.0

  continent     0        +inf.0        -inf.0        +nan.0        +nan.0

  country       0        +inf.0        -inf.0        +nan.0        +nan.0

  gdpPercap     0        241.17     113523.13       7215.33       9854.56

  lifeExp       0          23.6          82.6         59.47         12.91

  pop           0         60011    1318683096   29601212.32  106126742.55

  year          0          1952          2007        1979.5         17.26

So, we know that we have GDP per capita and life-expectancy in our dataset. Let’s say we wanted to make a scatterplot of these two variables versus each other. Using Graphite, that would look like this:
> (graph #:data gapminder
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         (points))

image

Let’s break down this code. The main form is graph, which takes a number of keyword arguments. The #:data keyword argument specifies the data-frame that we want to plot.

The #:mapping keyword argument specifies our aes (standing for aesthetics), which dictates how we actually want the dat to be shown on the plot. In this case, our mapping states that we want to map the x-axis to the variable gdpPercap, and the y-axis to the variable lifeExp.

Finally, the rest of our arguments dictate our renderers. In this case, the points renderer states that we want each data point to be drawn as a single point.

This plot is fine, but it’s rather unenlightening — we have a lot of blank space towards the bottom. This can be remedied by adding a logarithmic transform on the x-axis. We could specify this manually, but Graphite already has a log transform predefined:
> (graph #:data gapminder
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points))

image

The #:x-transform keyword argument specifies a transform?, which combines a plot transform and ticks. In this case, we use the logarithmic-transform function, which is already defined.

This plot is starting to look nicer, but it’s pretty unenlightening. We don’t know anything about each country or how they’re stratified, we can’t figure out how many countries are present at any given point, we can’t extrapolate a meaningful relationship aside from "probably linear-ish", and we haven’t labeled our axes. We can start by adding labels, and setting the alpha value of the renderer to see where more countries are present:
> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4))

image

All we’ve done here is added labels and titles via their eponymous keyword arguments, and added a keyword to the renderer points.

We can then start thinking about relationships between all the data-points. Let’s say we wanted to add a linear fit to our plot. Then, we can use the fit renderer:
> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4)
         (fit #:width 3))

image

fit defaults to a linear fit (of degree 1), but you can instead do a fit using a higher-degree polynomial with the optional #:degree argument:
> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4)
         (fit #:degree 3 #:width 3))

image

but this is ill-advised for the relationship we see here.

Finally, let’s try and extrapolate different relationships for each continent. We can stratify the points alone by using the aesthetic #:discrete-color to points, which lets us pick a categorical variable to change the color on, in this case "continent". Each renderer also takes its own mapping, which can be used to map some aesthetic to a variable.
> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4 #:mapping (aes #:discrete-color "continent"))
         (fit #:width 3))

image

Now we’re seeing some notable differences from where we’ve started! We made a scatter plot, transformed its axes, labeled it, and added aesthetics to make it more readable.

4 Bar charts

For this section, we’ll be using a CSV dump of the 2016 GSS (General Social Survey) from its respective R library, a dataset that sociologists continually manage to squeeze more and more insights out of. More importantly, the Gapminder dataset from the previous section has a lot of continuous variables (such as GDP per capita and life expectancy, which we worked with), but no categorical variables. The GSS has a wide variety of categorical variables to work with, making it ideal for making bar charts and histograms.

Similarly to last time, we load it up and take a gander:
> (define gss (df-read/csv "data/gss_sm.csv"))
> (df-describe gss)

data-frame: 33 columns, 2867 rows

properties:

series:

                NAs           min           max          mean        stddev

  #f              0        +inf.0        -inf.0        +nan.0        +nan.0

  age            10            18            89         49.16         17.69

  agegrp          0        +inf.0        -inf.0        +nan.0        +nan.0

  ageq            0        +inf.0        -inf.0        +nan.0        +nan.0

  ballot          0             1             3          2.02          0.81

  bigregion       0        +inf.0        -inf.0        +nan.0        +nan.0

  childs          8             0             8          1.85          1.67

  degree          5        +inf.0        -inf.0        +nan.0        +nan.0

  grass          57        +inf.0        -inf.0        +nan.0        +nan.0

  happy           1        +inf.0        -inf.0        +nan.0        +nan.0

  id              0             1          2867          1434        827.63

  income16      261        +inf.0        -inf.0        +nan.0        +nan.0

  income_rc       0        +inf.0        -inf.0        +nan.0        +nan.0

  kids            0        +inf.0        -inf.0        +nan.0        +nan.0

  madeg         115        +inf.0        -inf.0        +nan.0        +nan.0

  marital         0        +inf.0        -inf.0        +nan.0        +nan.0

  obama           0             0             1          0.63          0.48

  padeg         613        +inf.0        -inf.0        +nan.0        +nan.0

  partners      656        +inf.0        -inf.0        +nan.0        +nan.0

  partners_rc     4        +inf.0        -inf.0        +nan.0        +nan.0

  partyid        16        +inf.0        -inf.0        +nan.0        +nan.0

  polviews       48        +inf.0        -inf.0        +nan.0        +nan.0

  pres12        377             1             5          1.42          0.62

  race            0        +inf.0        -inf.0        +nan.0        +nan.0

  region          0        +inf.0        -inf.0        +nan.0        +nan.0

  relig          10        +inf.0        -inf.0        +nan.0        +nan.0

  religion        0        +inf.0        -inf.0        +nan.0        +nan.0

  sex             0        +inf.0        -inf.0        +nan.0        +nan.0

  siblings        0        +inf.0        -inf.0        +nan.0        +nan.0

  sibs            3             0            43          3.72          3.21

  wtssall         0          0.39          4.31             1           0.5

  year            0          2016          2016          2016             0

  zodiac         15        +inf.0        -inf.0        +nan.0        +nan.0

Clearly, we have a lot of data to work with here, but a lot of it is categorical, so df-describe’s summary statistics aren’t particularly useful.

We start off by taking a look at the variable religion, which contains a condensed version of the religions in the GSS. (The variable relig is more descriptive, but has too many categories for simple examples.) We use the bar renderer, with no arguments, to take a look at the count:
> (graph #:data gss
         #:title "Religious preferences, GSS 2016"
         #:mapping (aes #:x "religion")
         (bar))

image

Let’s say that we wanted to, instead, look at the proportion of each religion among the whole, rather than its individual count. We can specify this with the #:mode argument of bar, which can either be count or prop, with count being the default behavior we saw before.
> (graph #:data gss
         #:title "Religious preferences, GSS 2016"
         #:mapping (aes #:x "religion")
         (bar #:mode 'prop))

image

With the y-axis representing proportions from 0 to 1, we now have a good idea of what’s going on here. Similarly to the last example with Gapminder, let’s say that we wanted to split on each region, cross-classifying between the categorical variables of religion and bigregion (Northeast/Midwest/South/West, in the US). To accomplish this, we can make the x-axis region, and then "dodge" on the variable religion – effectively, making each individual region its own bar chart. To do this, we use the aesthetic #:group, and adjust the plot size:
> (graph #:data gss
         #:title "Religious preferences among regions, GSS 2016"
         #:mapping (aes #:x "bigregion" #:group "religion")
         #:width 600 #:height 400
         (bar #:mode 'prop))

image

But this is pretty difficult to read as well! There’s a lot of bars in each section, and you’re forced to consult the legend for the bar colors for each one. To mitigate this, we can introduce another concept...

5 Faceting

In both the Gapminder and GSS examples, we ended with a whole bunch of information crammed into a single visualization. Let’s say we instead wanted to make things more clear. In this case, we can add a facet to our visualization, which creates multiple plots in different panels.

Faceting is a global aesthetic (used by graph, and not individual renderers), dictated by the keyword #:facet. To use it, we specify it to be a variable, and graph will split the plot up for us. Take the previous GSS example: let’s say we wanted to instead facet on bigregion, and then have each subplot represent religious preferences in that region. In that case:
> (graph #:data gss
         #:mapping (aes #:x "religion" #:facet "bigregion")
         #:width 700 #:height 700
         #:x-conv (λ (x) (substring x 0 2))
         (bar #:mode 'prop))

image

Now we’ve managed to split up our visualization into seperate charts for each region.

We can do this in a similar fashion for the Gapminder plot from before, instead of using #:discrete-color:
> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp" #:facet "continent")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4)
         (fit #:width 3))

image