data-frame
1 Rationale
2 Creating data frames
data-frame?
series?
make-data-frame
make-series
df-add-series
df-del-series
df-add-derived
df-add-lazy
df-set-sorted
df-set-contract
3 Loading data into data-frames and saving it out again
df-read/  sql
df-read/  csv
df-write/  csv
df-read/  gpx
df-write/  gpx
4 Inspecting and extracting data
df-describe
df-series-names
df-property-names
df-contains?
df-contains/  any?
df-put-property
df-get-property
df-del-property
df-row-count
df-select
df-select*
in-data-frame
in-data-frame/  list
df-index-of
df-index-of*
df-ref
df-ref*
df-set!
df-lookup
df-lookup*
df-lookup/  interpolated
df-map
df-for-each
df-fold
df-count-na
df-is-na?
df-has-na?
df-has-non-na?
5 Statistics
df-set-default-weight-series
df-get-default-weight-series
df-statistics
df-quantile
6 Histograms and histogram plots
df-histogram
histogram-renderer
combine-histograms
histogram-renderer/  dual
histogram-renderer/  factors
7.0

data-frame

Alex Harsanyi

 (require data-frame) package: data-frame

A data frame is a data structure used to hold data in tables with rows and columns. It is meant for conveninent access and manipulation of relatively large data sets, however these data sets must fit into the process memory.

1 Rationale

Consider an example: during a sport activity, a sport watch will record data at periodic intervals, usually once every second. The data recorded might be time stamp, latitude, longitude, heart rate, distance, speed, cadence, power, etc. A high end sports watch will record up to 40 such measurements every second.

A simple approach for representing this data is to define a "DataPoint" structure, containing members for each possible values and represent the entire activity as a vector of data points. This approach has several problems:

A data frame object addresses the problems above by storing measurements for the same parameter together. Essentially, all heart rate measurements are stored together in a vector, all cadence measurements are stored together in a different vectors. All these vectors have the same number of elements (the number of data points) and the same position in each such vector represents data at a certain point in time. This data organization has some advantages:

2 Creating data frames

The functions below allow constructing new data frames. They are mainly intended for writing functions that load data into data frames from different sources.

procedure

(data-frame? df)  boolean?

  df : any/c
Return #t if df is a data frame

procedure

(series? series)  boolean?

  series : any/c
Return #t if series is a data series

procedure

(make-data-frame)  data-frame?

Return a new empty data frame

procedure

(make-series name    
  #:data data    
  #:cmpfn cmpfn    
  #:na na    
  #:contract contractfn)  series?
  name : string?
  data : vector?
  cmpfn : (or/c #f (-> any/c any/c boolean?))
  na : any/c
  contractfn : (-> any/c boolean?)
Create a new data series named name with contents from data.

cmpfn specifies an ordering function to use. If present, values can be looked up in this series using df-index-of and df-lookup. The data must be ordered according to this function

na specifies the "not available" value for this series, by default it is #f

contractfn is a contract function. If present, all values in the data series, except NA values must satisfy this contract.

procedure

(df-add-series df series)  any/c

  df : data-frame?
  series : series?
Add a new series to the data frame. If the data frame is empty, the series can have any number of elements, otherwise it must have the same number of elements as the other series in the data frame. See also df-row-count, make-series

procedure

(df-del-series df name)  any/c

  df : data-frame?
  name : string?
Remove the series named name from the data frame df

procedure

(df-add-derived df name base-series value-fn)  any/c

  df : data-frame?
  name : string?
  base-series : (listof string?)
  value-fn : mapfn/c
Add a new series named name to the data frame df with values that are computed from existing series. The data for the series is created using df-map by applying value-fn on base-series and the resulting data is added to the data frame. See df-map for notes on the value-fn.

If a series named name already exists in the data frame, it will be replaced.

procedure

(df-add-lazy df name base-series value-fn)  any/c

  df : data-frame?
  name : string?
  base-series : (listof string?)
  value-fn : mapfn/c
Add a new series to the data frame, but delay creating it until it is referenced. This function allows adding many series to a data frame, with the expectation that the cost to create those series is paid when (and if) they are used. See df-add-derived for the parameter names.

procedure

(df-set-sorted df name cmpfn)  any/c

  df : data-frame?
  name : string?
  cmpfn : 
(or/c
#f (-> any/c any/c boolean?))
Mark the series name inside the data frame df as sorted according to cmpfn. This does not actually sort the data series, it just tells the data frame that the series can be used for index lookup by df-index-of and df-lookup. An error is raised if the series is not actually sorted or if it contains NA values.

procedure

(df-set-contract df name contractfn)  any/c

  df : data-frame?
  name : string?
  contractfn : (or/c #f (-> any/c boolean?))
Set the contract for values in the data frame df series name to contractfn. An exception is thrown if not all values in the series match contractfn or are NA. The contractfn need not return #t for the NA value.

3 Loading data into data-frames and saving it out again

The functions construct data frames by loading data from different sources.

procedure

(df-read/sql db query param ...)  data-frame?

  db : connection?
  query : (or/c string? virtual-statement?)
  param : any/c
Create a data frame from the result of running query on the database db with the supplied list of parameters. Each column from the result set will become a series in the data frame, sql-null values will be converted to #f.

procedure

(df-read/csv input    
  #:headers? headers?    
  #:na na)  data-frame?
  input : (or/c path-string? input-port?)
  headers? : boolean?
  na : string?
Read CSV data in a data frame from the input which is either a port or a string, in which case it is assumed to be a file name. If headers? is true, the first row in input becomes the names of the columns, otherwise, the columns will be named "col1", "col2", etc. The first row defines the number of columns: if subsequent rows have fewer cells, they are padded with #f, if it has more, they are silently truncated.

na represents the value in the CSV file that represents the "not available" value in the data frame. Strings equal? to this value will be replaced by #f.

procedure

(df-write/csv df    
  output    
  #:start start    
  #:stop stop    
  series ...)  any/c
  df : data-frame?
  output : (or/c path-string? output-port?)
  start : exact-nonnegative-integer?
  stop : exact-nonnegative-integer?
  series : string?
Write the data frame df to output which is either an output port or a string, in which case it is assumed to be a file name. The series to be written out can be specified as the series list. If no series are specified, all series in the data frame are written out as columns in an unspecified order.

start and stop denote the beginning and end rows to be written out, by default all rows are written out.

procedure

(df-read/gpx input)  data-frame?

  input : (or/c path-string? input-port?)
Construct a data frame from the GPX document specified in input, which is either an input port or a string, in which case it denotes an input file. The data frame will have "timestamp", "lat", "lon", "alt", "dst" and "grade" series, the last two series computed and not read from the GPX file.

The data frame will also have the following properties:

Only the first track segment in the GPX file will be read.

procedure

(df-write/gpx df output #:name name)  any/c

  df : data-frame?
  output : (or/c path-string? output-port?)
  name : (or/c #f string?)
Export the GPS track from the data frame df to output, which is either an output port or a string, in which case it denotes a file name. The data frame is expected to contain the "timestamp", "lat", "lon" series, and optionally "alt" or "calt" (corrected altitude) series.

The entire GPS track is exported as a single track segment.

The laps property, if present, is assumed to contain a list of timestamps and the positions corresponding to these timestamps are exported as way points.

The name of the segment can be specified as the name parameter. If this is #f, the 'name property in the data frame is consulted, if that one is missing a default track name is used.

4 Inspecting and extracting data

procedure

(df-describe df)  any/c

  df : data-frame?
Print to the current-output-port a nice description of df, This is useful in interactive mode.

procedure

(df-series-names df)  (listof string?)

  df : data-frame?
Return the series names in the data frame df, as a list of strings. The names are returned in an unspecified order.

procedure

(df-property-names df)  (listof symbol?)

  df : data-frame?
Return the property names in the data frame df, as a list of symbols. The names are returned in an unspecified order.

procedure

(df-contains? df series ...)  boolean?

  df : (data-frame?)
  series : string?
Return #t if the data frame df contains all the series specified as arguments.

procedure

(df-contains/any? df series ...)  boolean?

  df : (data-frame?)
  series : string?
Return #t if the data frame df contains at least one of the series specified as arguments.

procedure

(df-put-property df key value)  any/c

  df : data-frame?
  key : symbol?
  value : any/c
Set the property key to value inside the data frame df. If there is already a value for the property key, it is replaced.

procedure

(df-get-property df key [default])  any/c

  df : data-frame?
  key : symbol?
  default : (-> any/c) = (lambda () #f)
Return the value for the property key in the data frame df. If there is no value for key, the default function is called to return a value (the default just returns #f)

procedure

(df-del-property df key)  any/c

  df : data-frame?
  key : symbol?
Delete the value for the property key from the data frame df. Does nothing if there is no value for the property key.

procedure

(df-row-count df)  exact-nonnegative-integer?

  df : data-frame?
Return the number of rows in the data frame df. All series inside the data frame have the same number of rows.

procedure

(df-select df    
  series    
  [#:filter filter    
  #:start start    
  #:stop stop])  vector?
  df : data-frame?
  series : string?
  filter : (or/c #f (-> any/c any/c)) = #f
  start : index/c = 0
  stop : index/c = (df-row-count df)
Return a vector with the values in the series series from the data frame df.

start and stop indicate the first and one-before-last row to be selected. filter, when present, will filter values selected: only values for which the function returns #t will be added to the resulting vector.

If there is no filter specified, the resulting vector will have (- stop start) elements. If there is a filter, the number of elements depends on how many are filtered out by this function.

procedure

(df-select* df    
  #:filter filter    
  [#:start start    
  #:stop stop]    
  series ...)  vector?
  df : data-frame?
  filter : (or/c #f (-> any/c any/c))
  start : index/c = 0
  stop : index/c = (df-row-count df)
  series : string?
Return a vector containing elements from the series of the data frame df. Each element in the result is a vector containing values from series at the corresponding row.

start and stop indicate the first and one-before-last row to be selected. filter, when present, will filter values selected: only values for which the function returns #t will be added to the resulting vector.

If there is no filter specified, the resulting vector will have (- stop start) elements. If there is a filter, the number of elements depends on how many are filtered out by this function.

procedure

(in-data-frame df    
  [#:start start    
  #:stop stop]    
  series ...)  sequence?
  df : data-frame?
  start : index/c = 0
  stop : index/c = (df-row-count df)
  series : string?
Return a sequence that produces values from a list of series between start and stop rows. The sequence produces values, each one corresponding to one of the series.

This is intended to be used in for and related constructs to iterate over elements in the data frame:

(for (([lat lon] (in-data-frame df "lat" "lon")))
  (printf "lat = ~a, lon = ~a~%" lat lon))

procedure

(in-data-frame/list df    
  [#:start start    
  #:stop stop]    
  series ...)  sequence?
  df : data-frame?
  start : index/c = 0
  stop : index/c = (df-row-count df)
  series : string?
Like in-data-frame, except the sequence produce a single value, which is a list with all the values from the selected series:

(for ((coord (in-data-frame/list df "lat" "lon")))
   (match-define (list lat lon) coord)
   (printf "lat = ~a, lon = ~a~%" lat lon))

procedure

(df-index-of df series value)  index/c

  df : data-frame?
  series : string?
  value : any/c

procedure

(df-index-of* df series value ...)  (listof index/c)

  df : data-frame?
  series : string?
  value : any/c
Find the position of a value or list of values in a series of the data frame df. Returns either a single value or a list of values.

The series must be sorted, see df-set-sorted, otherwise the calls will raise an error.

The value need not be present in the series, in that case, the returned index is the position of the first element which comes after the value, according to the sort function. This is the position where value could be inserted and still keep the series sorted. A value of 0 is returned if value is less or equal than the first value of the series and a value of (df-row-count df) is returned if the value is greater than all the values in series.

procedure

(df-ref df index series)  any/c

  df : data-frame?
  index : index/c
  series : string?

procedure

(df-ref* df index series ...)  vector?

  df : data-frame?
  index : index/c
  series : string?
Return the value at index for series in the data frame df. The second form allows referencing values from multiple series, and a vector containing the values is returned in this case.

procedure

(df-set! df index value series)  any/c

  df : data-frame?
  index : index/c
  value : any/c
  series : string?
Update the value at index in the series to value. The new value must keep the series sorted (if the series is sorted) and match the series contract, if any.

procedure

(df-lookup df base-series series value)  any/c

  df : data-frame?
  base-series : string?
  series : (or/c string? (listof string?))
  value : any/c

procedure

(df-lookup* df base-series series value ...)  list?

  df : data-frame?
  base-series : string?
  series : (or/c string? (listof string?))
  value : any/c
Lookup the index for value in base-series and return the corresponding value in series. if series is a single string, a single value is returned, if it is a list of names, a list of values is returned.

df-lookup* allows looking up multiple values and will return a list of the corresponding values.

These functions combine df-index-of and df-ref into a single function.

procedure

(df-lookup/interpolated df    
  base-series    
  series    
  value    
  #:interpolate interpolate    
  [lambda])  any/c
  df : data-frame?
  base-series : string?
  series : (or/c string? (listof string?))
  value : any/c
  interpolate : (-> real? any/c any/c any/c)
  lambda : (t v1 v2) = (+ (* t v1) (* (- 1 t) v2))
Perform an interpolated lookup: same as df-lookup, but if value is not found exactly in base-series, it’s relative position is determined and it is used to interpolate values from the corresponding series.

An interpolation function can be specified, if the default one is not sufficient. This function is called once for each value resulting series (i.e. it interpolates values one by one).

procedure

(df-map df    
  series    
  fn    
  [#:start start    
  #:stop stop])  vector?
  df : data-frame?
  series : (or/c string? (listof string?))
  fn : mapfn/c
  start : index/c = 0
  stop : index/c = (df-row-count df)
Map the function fn over a list of series between start and stop rows. Returns a vector with the values that fn returns.

fn is a function of ether one or two arguments. If fn is a function with one argument, it is called with the values from all series as a single vector. If fn is a function of two arguments, it is called with the current and previous set of values, as vectors (this allows calculating "delta" values). I.e. fn is invoked as (fn prev current). If fn accepts two arguments, it will be invoked as (fn #f current) for the first element of the iteration.

procedure

(df-for-each df    
  series    
  fn    
  [#:start start    
  #:stop stop])  void
  df : data-frame?
  series : (or/c string? (listof string?))
  fn : mapfn/c
  start : index/c = 0
  stop : index/c = (df-row-count df)
Same as df-map, but all values returned by fn are discarded and the function returns nothing.

procedure

(df-fold df    
  series    
  init-value    
  fn    
  [#:start start    
  #:stop stop])  any/c
  df : data-frame?
  series : (or/c string? (listof string?))
  init-value : any/c
  fn : foldfn/c
  start : index/c = 0
  stop : index/c = (df-row-count df)
Fold a function fn over a list of series between start and stop rows. init-val is the initial value for the fold operation. The last value returned by fn is returned by this function.

fn is a function of ether two or three arguments. If fn is a function with two arguments, it is called with the fold value plus the values from all series is passed in as a single vector. If fn is a function of three arguments, it is called with the fold value plus the current and previous set of values, as vectors (this allows calculating "delta" values). I.e. fn is invoked as (fn val prev current). If fn accepts two arguments, it will be invoked as (fn init-val #f current) for the first element of the iteration.

procedure

(df-count-na df series)  exact-nonnegative-integer?

  df : data-frame?
  series : string?
Return the number of “NA” values in the series.

procedure

(df-is-na? df series value)  boolean?

  df : data-frame?
  series : string?
  value : any/c
Return #t if value is equal? to the “NA” value in the series. Each series in a data frame can have a different “Not available” value, but this value usually defaults to #f

procedure

(df-has-na? df series)  boolean?

  df : data-frame?
  series : string?
Return #t if series has any “NA” values.

procedure

(df-has-non-na? df series)  boolean?

  df : data-frame?
  series : string?
Return #t if series has any values outside the “NA” values.

(df-shallow-copy (-> data-frame? data-frame?)) (valid-only (-> any/c boolean?)))

5 Statistics

The following functions allow calculating statistics on data frame series. They build on top of the math/statistics module.

procedure

(df-set-default-weight-series df series)  any/c

  df : data-frame?
  series : (or/c #f string?)

procedure

(df-get-default-weight-series df)  (or/c #f string?)

  df : data-frame?
Set or return the default weight series for statistics operations. This series will be used as a weight series if none is specified for df-statistics or df-quantile. Set it to #f for no weight series to be used for statistics.

A weight series needs to be used when samples in the data frame don’t have equal weight. For example, if a parameter (e.g. heart rate) is recorded at variable intervals, simply averaging the values will not produce an accurate average, if a timer series is also present, it can be used as a weight series to produce a better average.

procedure

(df-statistics df 
  series 
  [#:weight-series weight-series 
  #:start start 
  #:stop stop]) 
  (or/c #f statistics?)
  df : data-frame?
  series : string?
  weight-series : string? = (df-get-default-weight-series df)
  start : exact-nonnegative-integer? = 0
  stop : exact-nonnegative-integer? = (df-row-count df)
Compute statistics for series in the data frame df. This calls update-statistics for the values in the series. The statistics computation will use weighting if a weight series is defined for the data frame, see df-set-default-weight-series.

procedure

(df-quantile df 
  series 
  #:weight-series string? 
  [#:less-than less-than] 
  qvalue ...) 
  (or/c #f (listof real?))
  df : data-frame?
  series : string?
  string? : (df-get-default-weight-series df)
  less-than : (-> any/c any/c boolean?) = <
  qvalue : (between/c 0 1)
Return the quantiles for the series in the data frame df. A list of quantiles is returned as specified by qvalue, or if no quantiles are specified, the list (0 0.25 0.5 1) is used. #:weight-series has the usual meaning, less-than is the ordering function passed to the quantile function.

6 Histograms and histogram plots

procedure

(df-histogram df 
  series 
  [#:weight-series weight-series 
  #:bucket-width bucket-width 
  #:trim-outliers trim-outliers 
  #:include-zeroes? include-zeroes? 
  #:as-percentage? as-percentage?]) 
  (or/c #f histogram/c)
  df : data-frame?
  series : string?
  weight-series : (or/c #f string?)
   = (df-get-default-weight-series df)
  bucket-width : real? = 1
  trim-outliers : (or/c #f (between/c 0 1)) = #f
  include-zeroes? : boolean? = #t
  as-percentage? : boolean? = #f
Create a histogram for series from the data frame df. The returned is a vector of values, each value is a vector of two values, the sample and the rank of that sample.

weight-series specifies the series to be used for weighting the samples. By default, it it uses the 'weight property stored in the data-frame, see df-set-default-weight-series. Use #f for no weighting, in this case, each sample will have a weight of 1.

bucket-width specifies the width of each histogram slot. Samples in the data series are grouped together into slots, which are from 0 to bucket-width, than from bucket-width to (* 2 bucket-width) and so on. The bucket-width value can be less than 1.0.

trim-outliers specifies to remove slots from both ends of the histogram that contain less than the specified percentage of values. When #f on slots are trimmed.

include-zeroes? specifies whether samples with a slot of 0 are included in the histogram or not. Note that slot 0 contains samples from 0 to bucket-width.

as-percentage? determines if the data in the histogram represents a percentage (all ranks add up to 100) or it is the rank of each slot.

procedure

(histogram-renderer histogram 
  [#:color color] 
  #:skip real? 
  [#:x-min x-min 
  #:label label 
  #:blank-some-labels blank-some-labels? 
  #:x-value-formatter formatter]) 
  (treeof renderer2d?)
  histogram : histogram/c
  color : any/c = #f
  real? : (discrete-histogram-skip)
  x-min : real? = 0
  label : string? = #f
  blank-some-labels? : boolean? = #t
  formatter : (or/c #f (-> number? string?)) = #f
Create a histogram plot renderer from data, which is a histogram created by df-histogram.

color determines the color of the histogram bars.

label specifies the label to use for this plot renderer.

skip and x-min are used to plot dual histograms, see histogram-renderer/dual.

All the above arguments are sent directly to the discrete-histogram

blank-some-labels?, controls if some of the labels are blanked out if the plot contains too many values, this can produce a nicer looking plot.

formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".

procedure

(combine-histograms h1 h2)  combined-histogram/c

  h1 : histogram/c
  h2 : histogram/c
Combine two histograms produced by df-histogram into a single one. The result of this function is intended to be passed to histogram-renderer/dual.

procedure

(histogram-renderer/dual combined-histogram 
  label1 
  label2 
  [#:color1 color1 
  #:color2 color2 
  #:x-value-formatter formatter]) 
  (treeof renderer2d?)
  combined-histogram : combined-histogram/c
  label1 : string?
  label2 : string?
  color1 : any/c = #f
  color2 : any/c = #f
  formatter : (or/c #f (-> number? string?)) = #f
Create a plot renderer that shows two histograms, with each slot side-by-side. The histograms can be produced by df-histogram and combined by combined-histogram.

label1 and color1 represent the label and colors for the first histogram, label2 and color2 represent the label and colors to use for the second histogram.

formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".

procedure

(histogram-renderer/factors histogram 
  factor-fn 
  factor-colors 
  [#:x-value-formatter formatter]) 
  (treeof renderer2d?)
  histogram : histogram/c
  factor-fn : (-> real? symbol?)
  factor-colors : (listof (cons/c symbol? color/c))
  formatter : (or/c #f (-> number? string?)) = #f
Create a histogram renderer where histogram is split into sections by factor-fn and each section is colored according to factor-colors.

formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".