Graphite is a library for data visualization, designed to create common types of plots with as little effort as possible, and allowing for easy manipulation of the plot around the data without having to change the structure of the data itself. The interface is loosely modeled around the tidyverse’s ggplot2, so if you’re familiar with it, you should feel at home.
This tutorial is modeled around Kieran Healy’s work Data Visualization: A practical introduction, which uses ggplot2 and R rather than Graphite and Racket.
Graphite is implemented on top of plot, and in no way serves as a replacement to it, instead serving as a complement. As a consequence, Graphite’s functionality is in no way a strict superset of plot.
Your data is comprised of discrete points. Graphite requires all data to be read into a data-frame before creating any visualization. This means that if your plot consists of continuous data (e.g. a function), Graphite is unlikely to fit your needs.
Your plot is intended for use as an image. Graphite exports all plots as a pict. For various reasons, it does not support the interactivity that plot snips provide. Graphite’s generated plots work great in Scribble, or embedded into any other document by saving the file using save-pict.
Your intended plot is 2D. Graphite does not support 3D plots, and it probably never will.
Your data is tidy, and consists of the data you actually want to show. Graphite assumes that your data is in "tidy" form, with each variable being a column and each observation being a row. Data wrangling is outside of the scope of Graphite’s functionality.
You don’t know what data visualization method you want to use. Graphite’s main goal is to prevent significant structural changes when, for example, switching from a scatter plot to a histogram.
If your data is untidy, it will require further processing before being read into a data-frame. Anecdotally, this is not particularly easy in Racket. All data in this tutorial is already tidy.
If your data is primarily continuous, needs to be interactive, or needs to be 3D, plot is likely to be a better fit.
Graphite has a number of important forms that are effectively required for anything to come out of it.
First off, in order to plot data, we need to read data in. To do this, Graphite uses the data-frame library. Usually, to read in data from a CSV, you use df-read/csv. Various other formats are supported by data-frame, which you can find by consulting its documentation.
Aesthetic mappings, created with the aes form, creates a set of key-value mappings, with keys specified by keywords. Each key represents an "aesthetic" to map a variable to, and each value represents a variable in the data to map it to. If a value is not present and not mandatory, it is read as #f. These can correspond to positional aesthetics (the x-axis with #:x, and the y-axis with #:y), colorings (such as #:discrete-color to points), or another required variable (such as #:perc-error to error-bars). The most important thing is that aesthetic mappings only correspond to mapping an aesthetic to a variable: if you want to set it as a constant, you likely want a regular keyword argument.
All data visualization starts with data to visualize, and we begin with excerpts of data from Gapminder: more specifically, we begin with a CSV dump of the data in the Gapminder library for R. This data is already tidy and in the format we want, so we merely read it in as a CSV using df-read/csv from the data-frame library:
> (define gapminder (df-read/csv "data/all_gapminder.csv"))
> (df-describe gapminder)
data-frame: 7 columns, 1704 rows
NAs min max mean stddev
#f 0 +inf.0 -inf.0 +nan.0 +nan.0
continent 0 +inf.0 -inf.0 +nan.0 +nan.0
country 0 +inf.0 -inf.0 +nan.0 +nan.0
gdpPercap 0 241.17 113523.13 7215.33 9854.56
lifeExp 0 23.6 82.6 59.47 12.91
pop 0 60011 1318683096 29601212.32 106126742.55
year 0 1952 2007 1979.5 17.26
Let’s break down this code. The main form is graph, which takes a number of keyword arguments. The #:data keyword argument specifies the data-frame that we want to plot.
The #:mapping keyword argument specifies our aes (standing for aesthetics), which dictates how we actually want the dat to be shown on the plot. In this case, our mapping states that we want to map the x-axis to the variable gdpPercap, and the y-axis to the variable lifeExp.
Finally, the rest of our arguments dictate our renderers. In this case, the points renderer states that we want each data point to be drawn as a single point.
> (graph #:data gapminder #:mapping (aes #:x "gdpPercap" #:y "lifeExp") #:x-transform logarithmic-transform (points))
> (graph #:data gapminder #:title "GDP per capita vs life expectancy" #:x-label "GDP per capita (USD)" #:y-label "Life expectancy (years)" #:mapping (aes #:x "gdpPercap" #:y "lifeExp") #:x-transform logarithmic-transform (points #:alpha 0.4))
All we’ve done here is added labels and titles via their eponymous keyword arguments, and added a keyword to the renderer points.
> (graph #:data gapminder #:title "GDP per capita vs life expectancy" #:x-label "GDP per capita (USD)" #:y-label "Life expectancy (years)" #:mapping (aes #:x "gdpPercap" #:y "lifeExp") #:x-transform logarithmic-transform (points #:alpha 0.4) (fit #:width 3))
> (graph #:data gapminder #:title "GDP per capita vs life expectancy" #:x-label "GDP per capita (USD)" #:y-label "Life expectancy (years)" #:mapping (aes #:x "gdpPercap" #:y "lifeExp") #:x-transform logarithmic-transform (points #:alpha 0.4) (fit #:degree 3 #:width 3))
> (graph #:data gapminder #:title "GDP per capita vs life expectancy" #:x-label "GDP per capita (USD)" #:y-label "Life expectancy (years)" #:mapping (aes #:x "gdpPercap" #:y "lifeExp") #:x-transform logarithmic-transform (points #:alpha 0.4 #:mapping (aes #:discrete-color "continent")) (fit #:width 3))
Now we’re seeing some notable differences from where we’ve started! We made a scatter plot, transformed its axes, labeled it, and added aesthetics to make it more readable.
For this section, we’ll be using a CSV dump of the 2016 GSS (General Social Survey) from its respective R library, a dataset that sociologists continually manage to squeeze more and more insights out of. More importantly, the Gapminder dataset from the previous section has a lot of continuous variables (such as GDP per capita and life expectancy, which we worked with), but no categorical variables. The GSS has a wide variety of categorical variables to work with, making it ideal for making bar charts and histograms.
> (define gss (df-read/csv "data/gss_sm.csv")) > (df-describe gss)
data-frame: 33 columns, 2867 rows
NAs min max mean stddev
#f 0 +inf.0 -inf.0 +nan.0 +nan.0
age 10 18 89 49.16 17.69
agegrp 0 +inf.0 -inf.0 +nan.0 +nan.0
ageq 0 +inf.0 -inf.0 +nan.0 +nan.0
ballot 0 1 3 2.02 0.81
bigregion 0 +inf.0 -inf.0 +nan.0 +nan.0
childs 8 0 8 1.85 1.67
degree 5 +inf.0 -inf.0 +nan.0 +nan.0
grass 57 +inf.0 -inf.0 +nan.0 +nan.0
happy 1 +inf.0 -inf.0 +nan.0 +nan.0
id 0 1 2867 1434 827.63
income16 261 +inf.0 -inf.0 +nan.0 +nan.0
income_rc 0 +inf.0 -inf.0 +nan.0 +nan.0
kids 0 +inf.0 -inf.0 +nan.0 +nan.0
madeg 115 +inf.0 -inf.0 +nan.0 +nan.0
marital 0 +inf.0 -inf.0 +nan.0 +nan.0
obama 0 0 1 0.63 0.48
padeg 613 +inf.0 -inf.0 +nan.0 +nan.0
partners 656 +inf.0 -inf.0 +nan.0 +nan.0
partners_rc 4 +inf.0 -inf.0 +nan.0 +nan.0
partyid 16 +inf.0 -inf.0 +nan.0 +nan.0
polviews 48 +inf.0 -inf.0 +nan.0 +nan.0
pres12 377 1 5 1.42 0.62
race 0 +inf.0 -inf.0 +nan.0 +nan.0
region 0 +inf.0 -inf.0 +nan.0 +nan.0
relig 10 +inf.0 -inf.0 +nan.0 +nan.0
religion 0 +inf.0 -inf.0 +nan.0 +nan.0
sex 0 +inf.0 -inf.0 +nan.0 +nan.0
siblings 0 +inf.0 -inf.0 +nan.0 +nan.0
sibs 3 0 43 3.72 3.21
wtssall 0 0.39 4.31 1 0.5
year 0 2016 2016 2016 0
zodiac 15 +inf.0 -inf.0 +nan.0 +nan.0
Clearly, we have a lot of data to work with here, but a lot of it is categorical, so df-describe’s summary statistics aren’t particularly useful.
But this is pretty difficult to read as well! There’s a lot of bars in each section, and you’re forced to consult the legend for the bar colors for each one. To mitigate this, we can introduce another concept...
In both the Gapminder and GSS examples, we ended with a whole bunch of information crammed into a single visualization. Let’s say we instead wanted to make things more clear. In this case, we can add a facet to our visualization, which creates multiple plots in different panels.
Now we’ve managed to split up our visualization into seperate charts for each region.