data-frame
1 Creating data frames
data-frame?
series?
make-data-frame
make-series
df-add-series!
df-del-series!
df-rename-series!
df-add-derived!
df-add-lazy!
df-row-count
df-series-names
df-contains?
df-contains/  any?
df-duplicate-series
df-set-sorted!
df-is-sorted?
df-set-contract!
df-shallow-copy
for/  data-frame
for*/  data-frame
2 Loading and Saving Data
df-read/  csv
df-write/  csv
df-read/  sql
3 Inspecting data series
df-describe
4 Working With Properties
df-property-names
df-put-property!
df-get-property
df-del-property!
5 NA Values
df-count-na
df-is-na?
df-has-na?
df-has-non-na?
df-na-value
6 Get and Set Individual Values
df-ref
df-ref*
df-set!
7 Indexing And Row Lookup
df-add-index!
df-add-index*!
df-del-index!
df-index-names
df-index-series
df-index-of
df-index-of*
df-equal-range
df-all-indices-of
df-lookup
df-lookup*
df-lookup/  interpolated
8 Extracting Data
df-select
df-select*
df-select/  by-index
df-select/  by-index*
df-select*/  by-index
df-select*/  by-index*
valid-only
9 Iterating Over Rows
df-map
df-map/  by-index
df-map/  by-index*
df-for-each
df-for-each/  by-index
df-for-each/  by-index*
df-fold
df-fold/  by-index
df-fold/  by-index*
in-data-frame
in-data-frame/  as-list
in-data-frame/  as-vector
in-data-frame/  by-index
in-data-frame/  by-index/  as-list
in-data-frame/  by-index/  as-vector
in-data-frame/  by-index*
in-data-frame/  by-index*/  as-list
in-data-frame/  by-index*/  as-vector
10 Statistics
df-set-default-weight-series!
df-get-default-weight-series
df-statistics
df-quantile
11 Least Squares Fitting
least-squares-fit
df-least-squares-fit
12 Histograms and histogram plots
df-histogram
histogram-renderer
combine-histograms
histogram-renderer/  dual
histogram-renderer/  factors
13 GPX Files
df-read/  gpx
df-write/  gpx
14 TCX Files
df-read/  tcx
df-read/  tcx/  multiple
8.2

data-frame

Alex Harsányi

 (require data-frame) package: data-frame

A data frame is a data structure used to hold data in tables with rows and columns. It is meant for conveninent access and manipulation of relatively large data sets that fit in the process memory. The package also provides functions for loading and saving data from data frames from various formats, as well as several utilities and helper functions for statistical calculations, plotting and curve fitting.

The package was initially written to support data the data processing needs for ActivityLog2, and since that application mostly deals with time series based data, the package supports data which is naturally ordered (whether by a timestamp or some other data column). However, the package does support more general data which is not necessarily ordered and also supports efficient lookups using secondary indexes, allowing multiple traversal orders to be defined for the data.

With data science a popular topic, a package named data-frame inevitably brings the question of how does it compare with other implementations with similar names and descriptions. The package author does not know the answer to that question, except to say that any resemblance to other implementatins is purely coincidental.

1 Creating data frames

The functions below allow constructing new data frames. They are mainly intended for writing functions that load data into data frames from different sources, rather for direct use in other programs. To load data into data frames, see df-read/csv or df-read/sql and for manually creating data frames, for/data-frame.

procedure

(data-frame? df)  boolean?

  df : any/c
Return #t if df is a data frame.

procedure

(series? series)  boolean?

  series : any/c
Return #t if series is a data series

procedure

(make-data-frame)  data-frame?

Return a new, empty, data frame. The data frame does not contain any series, and its size will be determined by the first series added, see also df-add-series!.

procedure

(make-series name    
  #:data data    
  #:cmpfn cmpfn    
  #:na na    
  #:contract contractfn)  series?
  name : string?
  data : vector?
  cmpfn : (or/c #f (-> any/c any/c boolean?))
  na : any/c
  contractfn : (-> any/c boolean?)
Create a new data series named name with contents from data.

cmpfn specifies an ordering function to use. If present, values can be looked up in this series using df-index-of and df-lookup. The data must be ordered according to this function

na specifies the "not available" value for this series, by default it is #f.

contractfn is a contract function. If present, all values in the data series, except NA values, must satisfy this contract, that is, the function must return #t for all values in the series.

procedure

(df-add-series! df series)  any/c

  df : data-frame?
  series : series?
Add a new series to the data frame. If the data frame contains no other series, this series can have any number of elements, otherwise it must have the same number of elements as the other series in the data frame.

If the data frame already contains a series with the same name, that series will be replaced.

See also df-row-count and make-series.

procedure

(df-del-series! df name)  any/c

  df : data-frame?
  name : string?
Remove the series named name from the data frame df. Does nothing if the series does not exist.

procedure

(df-rename-series! df old-name new-name)  any/c

  df : data-frame?
  old-name : string?
  new-name : string?
Rename a series in the data frame df from old-name to new-name. A series named old-name must exist, otherwise an error is signaled.

This operation will materialize all lazy series (see df-add-lazy!), making it a possibly costly operation if you have lazy series. Indices using old-name will also be updated to use new-name, but this operation is fast, and no re-indexing will be done.

procedure

(df-add-derived! df    
  name    
  base-series    
  value-fn)  any/c
  df : data-frame?
  name : string?
  base-series : (listof string?)
  value-fn : mapfn/c
Add a new series named name to the data frame df with values that are computed from existing series. The data for the series is created using df-map by applying value-fn on base-series and the resulting data is added to the data frame. See df-map for notes on the value-fn.

If a series named name already exists in the data frame, it will be replaced.

procedure

(df-add-lazy! df name base-series value-fn)  any/c

  df : data-frame?
  name : string?
  base-series : (listof string?)
  value-fn : mapfn/c
Add a new series to the data frame, same as df-add-derived!, but delay creating it until it is referenced. This function allows adding many series to a data frame, with the expectation that the cost to create those series is paid when (and if) they are used.

See df-add-derived! for the parameter names.

procedure

(df-row-count df)  exact-nonnegative-integer?

  df : data-frame?
Return the number of rows in the data frame df. All series inside the data frame have the same number of rows.

procedure

(df-series-names df)  (listof string?)

  df : data-frame?
Return a list of the names of all series in the data frame df. The names are returned in an unspecified order.

procedure

(df-contains? df series ...)  boolean?

  df : (data-frame?)
  series : string?

procedure

(df-contains/any? df series ...)  boolean?

  df : (data-frame?)
  series : string?
Return #t if the data frame df the series specified as arguments. df-contains? returns #t only if all the series are present, while df-contains/any? returns #t if any of the specified series are present.

procedure

(df-duplicate-series df name)  series?

  df : data-frame?
  name : string?
Return a new series by coping the series name from the data-frame df. All data and properties (like sort order, NA value and contract) are copied. New series is intended to be added to another data frame using df-add-series!.

If the series with name name is delayed, see df-add-lazy!, the series will be materialized first.

procedure

(df-set-sorted! df name cmpfn)  any/c

  df : data-frame?
  name : string?
  cmpfn : (or/c #f (-> any/c any/c boolean?))
Mark an already sorted series name in the data frame df as sorted according to cmpfn. Marking a series as sorted, allows it to be used for index lookup by df-index-of and df-lookup. An error is raised if the series is not actually sorted or if it contains NA values.

If the data in the series is not already sorted, and you want to lookup values using df-index-of or df-lookup, consider adding a secondary index using df-add-index!.

procedure

(df-is-sorted? df series)  boolean?

  df : data-frame?
  series : string?
Return #t if the given series in the data-frame df is sorted, that is df-set-sorted! has been called and succeeded for this series.

procedure

(df-set-contract! df name contractfn)  any/c

  df : data-frame?
  name : string?
  contractfn : (or/c #f (-> any/c boolean?))
Set the contract for values in the data frame df series name to contractfn. An exception is thrown if not all values in the series match contractfn or are NA. The contractfn need not return #t for the NA value.

procedure

(df-shallow-copy df)  data-frame

  df : data-frame?
Creates a copy of df. The returned copy will reference the same data series objects as the original (and the properties), but any add/delete operations, for both series and properties, will only affect the copy, however operations on the shared series (like setting a contract) will affect both data frames.

syntax

(for/data-frame (series-name ...) (for-clause ...) body-or-break ... body)

syntax

(for*/data-frame (series-name ...) (for-clause ...) body-or-break ... body)

Construct a new data-frame with the given series-names, by producing values in the for loop, row by row. The constructs iterate like for or for*, and each body must evaluate to a set of values that corresponds to each series-name, in order.

Examples:
> (define df
    (for/data-frame (ints strs)
                    ([int (in-range 5)]
                     [str (in-list (list "a" "b" "c" "d" "e"))])
      (values int str)))
> (df-select df "ints")

'#(0 1 2 3 4)

> (df-select df "strs")

'#("a" "b" "c" "d" "e")

Examples:
> (define df
    (for*/data-frame (ints strs)
                     ([int (in-range 2)]
                      [str (in-list (list "a" "b"))])
      (values int str)))
> (df-select df "ints")

'#(0 0 1 1)

> (df-select df "strs")

'#("a" "b" "a" "b")

2 Loading and Saving Data

The functions construct data frames by loading data from CSV files or by running an SQL query.

procedure

(df-read/csv input    
  [#:headers? headers?]    
  #:na na    
  [#:quoted-numbers? quoted-numbers?])  data-frame?
  input : (or/c path-string? input-port?)
  headers? : boolean? = #t
  na : (or/c string? (-> string? boolean?) "")
  quoted-numbers? : boolean? = #f
Read CSV data in a data frame from the input which is either a port or a string, in which case it is assumed to be a file name. If headers? is true, the first row in input becomes the names of the columns, otherwise, the columns will be named "col1", "col2", etc. The first row defines the number of columns: if subsequent rows have fewer cells, they are padded with #f, if it has more, they are silently truncated.

na represents the value in the CSV file that represents the "not available" value in the data frame. Strings equal? to this value will be replaced by #f. Alternatively, this can be a function which tests a string and returns #t if the string represents a NA value

When quoted-numbers? is #t, all quoted values in the CSV file will be converted to numbers, if possible. E.g. a value like "123" will be converted to the number 123 if quoted-numbers? is #t, but will remain the string "123" if the parameter is #f.

procedure

(df-write/csv df    
  output    
  #:start start    
  #:stop stop    
  series ...)  any/c
  df : data-frame?
  output : (or/c path-string? output-port?)
  start : exact-nonnegative-integer?
  stop : exact-nonnegative-integer?
  series : string?
Write the data frame df to output which is either an output port or a string, in which case it is assumed to be a file name. The series to be written out can be specified as the series list. If no series are specified, all series in the data frame are written out as columns in an unspecified order.

start and stop denote the beginning and end rows to be written out, by default all rows are written out.

procedure

(df-read/sql db query param ...)  data-frame?

  db : connection?
  query : (or/c string? virtual-statement?)
  param : any/c
Create a data frame from the result of running query on the database db with the supplied list of parameters. Each column from the result set will become a series in the data frame, sql-null values will be converted to #f.

3 Inspecting data series

procedure

(df-describe df)  any/c

  df : data-frame?
Print a nice description of df to the current-output-port. This function is useful in interactive mode to quickly check the series and properties available in a data frame.

4 Working With Properties

A data frame can have arbitrary data attached to it the form of key-value pairs, where the keys are symbols. This is usefull for attaching additional meta-data to data frames. The functions allow working with properties.

procedure

(df-property-names df)  (listof symbol?)

  df : data-frame?
Return the property names in the data frame df, as a list of symbols. The names are returned in an unspecified order.

procedure

(df-put-property! df key value)  any/c

  df : data-frame?
  key : symbol?
  value : any/c
Set the property key to value inside the data frame df. If there is already a value for the property key, it is replaced.

procedure

(df-get-property df key [default])  any/c

  df : data-frame?
  key : symbol?
  default : any/c = (lambda () #f)
Return the value for the property key in the data frame df. If there is no value for key, the default function is called to return a value (the default just returns #f)

procedure

(df-del-property! df key)  any/c

  df : data-frame?
  key : symbol?
Delete the value for the property key from the data frame df. Does nothing if there is no value for the property key.

5 NA Values

Data series support the concept that a value "not available". This is done using a special value, usually #t, but separate for each data series. The functions below allow working with “NA” values. The NA value is specified when the series is created using make-series.

procedure

(df-count-na df series)  exact-nonnegative-integer?

  df : data-frame?
  series : string?
Return the number of “NA” values in the series.

procedure

(df-is-na? df series value)  boolean?

  df : data-frame?
  series : string?
  value : any/c
Return #t if value is equal? to the “NA” value in the series. Each series in a data frame can have a different “Not available” value, but this value usually defaults to #f

procedure

(df-has-na? df series)  boolean?

  df : data-frame?
  series : string?
Return #t if series has any “NA” values.

procedure

(df-has-non-na? df series)  boolean?

  df : data-frame?
  series : string?
Return #t if series has any values outside the “NA” values.

procedure

(df-na-value df series)  any/c

  df : data-frame?
  series : string?
Return the “NA” value for the series in the data frame df.

6 Get and Set Individual Values

procedure

(df-ref df position series)  any/c

  df : data-frame?
  position : index/c
  series : string?

procedure

(df-ref* df position series ...)  vector?

  df : data-frame?
  position : index/c
  series : string?
Return the value at position for series in the data frame df. The second form allows referencing values from multiple series, and a vector containing the values is returned in this case.

procedure

(df-set! df position value series)  any/c

  df : data-frame?
  position : index/c
  value : any/c
  series : string?
Update the value at position in the series to value. The new value must keep the series sorted, if the series is sorted, and match the series contract, if a contract has been set for the series.

7 Indexing And Row Lookup

procedure

(df-add-index! df    
  name    
  series    
  lt    
  [#:na-in-front? na-in-front?])  any/c
  df : data-frame?
  name : string?
  series : string?
  lt : (-> any/c any/c boolean?)
  na-in-front? : boolean? = #f

procedure

(df-add-index*! df    
  name    
  series    
  lt    
  [#:na-in-front? na-in-front?])  any/c
  df : data-frame?
  name : string?
  series : (listof string?)
  lt : (listof (-> any/c any/c boolean?))
  na-in-front? : boolean? = #f
Add a secondary index to the data frame df named name – if an index by that name already exists, it will be replaced. A secondary index will allow fast lookups (see df-index-of and df-lookup) and iteration (see in-data-frame/by-index) in the order defined by the ordering function. Multiple indexes can be defined for a data frame for one or more columns and they will be used as needed.

df-add-index! will create an index on a single series and use the lt function for comparing elements, this function must provide a strict less-than ordering, and suitable values would be < for numbers and string<? for strings, although any function can be defined.

The na-in-front? determines where the “NA” values in the series are placed. If it is #t, they are placed before all other values, otherwise they are placed at the end.

The df-add-index*! will define a multi-column index on all the series specified as a list, the lt parameter fot this function is a list of comparison functions, one for each columns. Such an index will sort by the first column, than, for all equal values in the first column, it will sort on the second column, and so on.

A multi-column index is mosty used for defining a multi-column iteration order, however, such an index can still be used for fast lookup for elements in the first indexed series.

procedure

(df-del-index! df name)  any/c

  df : data-frame?
  name : string?
Delete the index named name from the data frame df. Does nothing if an index by this name does not exist.

procedure

(df-index-names df)  (listof string?)

  df : data-frame?
Return the list of index names defined for the data frame df. The order of the index names is undedined.

procedure

(df-index-series df name)  (listof string?)

  df : data-frame?
  name : string?
Return the list of series names indexed by the index name in the data frame df. The series names is returned in the order in which they are indexed.

procedure

(df-index-of df    
  series    
  value    
  #:exact-match? exact-match?)  index/c
  df : data-frame?
  series : string?
  value : any/c
  exact-match? : #f

procedure

(df-index-of* df    
  series    
  #:exact-match? exact-match?    
  value ...)  (listof index/c)
  df : data-frame?
  series : string?
  exact-match? : #f
  value : any/c
Find the position of a value or list of values in a series of the data frame df. Returns either a single value or a list of values.

The series must either be sorted, see df-set-sorted!, or an index must be defined for it, see df-add-index! and df-add-index*!, otherwise the calls will raise an error.

exact-match? defines what to do when the value(s) are not found in the data series. If it is #t and the value is not found, the functions return #f.

If exact-match? is #f, the value need not be present in the series, in that case, the returned index is the position of the first element which comes after the value, according to the sort function. This is the position where value could be inserted and still keep the series sorted. A value of 0 is returned if value is less or equal than the first value of the series and a value of (df-row-count df) is returned if the value is greater than all the values in series.

procedure

(df-equal-range df series value)  
index/c index/c
  df : data-frame?
  series : string?
  value : any/c
Finds the lower bound of appearance (inclusive) and upper bound of appearance (exclusive) of value, and return them respectively, in the data frame df. This is useful for when a given series has multiple elements, and you want to find all of their occurrences. As the given series must be sorted, this is a range, and not a collection of indices.

The series must be sorted (see df-set-sorted!), or else this will error.

The given value need not be present in the series. If this is the case, the lower bound and upper bound are the same and represent the position of the first element which comes before value, according to the sort function. This is the position in which the given value could be inserted and keep the series sorted.

procedure

(df-all-indices-of df series value)  (listof index/c)

  df : data-frame?
  series : string?
  value : any/c
Return the list of positions where value is found in the series of the data frame df. Returns an empty list if value does not exist.

The series must be either sorted, see df-set-sorted!, or have an index defined for it, see df-add-index! and df-add-index*!, otherwise an error is reported.

procedure

(df-lookup df    
  base-series    
  series    
  value    
  #:exact-match? exact-match?)  any/c
  df : data-frame?
  base-series : string?
  series : (or/c string? (listof string?))
  value : any/c
  exact-match? : #f

procedure

(df-lookup* df    
  base-series    
  series    
  #:exact-match? exact-match?    
  value ...)  list?
  df : data-frame?
  base-series : string?
  series : (or/c string? (listof string?))
  exact-match? : #f
  value : any/c
Lookup the index for value in base-series and return the corresponding value in series. if series is a single string, a single value is returned, if it is a list of names, a list of values is returned.

df-lookup* allows looking up multiple values and will return a list of the corresponding values.

These functions combine df-index-of and df-ref into a single function and has the same restrictions as df-index-of: the series must either be sorted or an index defined for it.

exact-match? has the same meaning as for df-index-of

procedure

(df-lookup/interpolated df    
  base-series    
  series    
  value    
  #:interpolate interpolate    
  [lambda])  any/c
  df : data-frame?
  base-series : string?
  series : (or/c string? (listof string?))
  value : any/c
  interpolate : (-> real? any/c any/c any/c)
  lambda : (t v1 v2) = (+ (* t v1) (* (- 1 t) v2))
Perform an interpolated lookup: same as df-lookup, but if value is not found exactly in base-series, it’s relative position is determined and it is used to interpolate values from the corresponding series. This only works for sorted series, see df-set-sorted!.

An interpolation function can be specified, if the default one is not sufficient. This function is called once for each value resulting series (i.e. it interpolates values one by one).

8 Extracting Data

procedure

(df-select df    
  series    
  [#:filter filter    
  #:start start    
  #:stop stop])  vector?
  df : data-frame?
  series : string?
  filter : (or/c #f (-> any/c any/c)) = #f
  start : index/c = 0
  stop : index/c = (df-row-count df)

procedure

(df-select* df    
  [#:filter filter    
  #:start start    
  #:stop stop]    
  series ...)  vector?
  df : data-frame?
  filter : (or/c #f (-> any/c any/c)) = #f
  start : index/c = 0
  stop : index/c = (df-row-count df)
  series : string?
df-select returns a vector with the values in the series series from the data frame df, while df-select* returns a vector where each element is a vector containing values from one ore more series specified as an argument.

start and stop indicate the first and one-before-last row to be selected. filter, when present, will filter values selected: only values for which the function returns #t will be added to the resulting vector.

If there is no filter specified, the resulting vector will have (- stop start) elements. If there is a filter, the number of elements depends on how many are filtered out by this function.

procedure

(df-select/by-index df    
  series    
  #:index index-name    
  [#:from from    
  #:to to    
  #:filter filter])  vector?
  df : data-frame?
  series : string?
  index-name : string?
  from : any/c = #f
  to : any/c = #f
  filter : (or/c #f (-> any/c any/c)) = #f

procedure

(df-select/by-index* df    
  series    
  #:index index-name    
  [#:from from    
  #:to to    
  #:filter filter])  vector?
  df : data-frame?
  series : string?
  index-name : string?
  from : (listof any/c) = #f
  to : (listof any/c) = #f
  filter : (or/c #f (-> any/c any/c)) = #f

procedure

(df-select*/by-index df    
  #:index index-name    
  [#:from from    
  #:to to    
  #:filter filter]    
  series ...)  vector?
  df : data-frame?
  index-name : string?
  from : any/c = #f
  to : any/c = #f
  filter : (or/c #f (-> any/c any/c)) = #f
  series : string?

procedure

(df-select*/by-index* df    
  #:index index-name    
  [#:from from    
  #:to to    
  #:filter filter]    
  series ...)  vector?
  df : data-frame?
  index-name : string?
  from : (listof any/c) = #f
  to : (listof any/c) = #f
  filter : (or/c #f (-> any/c any/c)) = #f
  series : string?
Same as df-select, and df-select*, but these functions return elements in the order defined by the index index-name, and elements are returned between the rows defined by the first occurence of from and the last occurence of to in the indexed series (the indexed series can contain duplicates).

HINT: to select data from all rows where a colunm has a specified value, define an index for the column and select using the same value for both from and to.

A multi-column index can also be iterated using a single value, in which case only the first column of the index is used. For example, if you have a data frame with Country, City and CityPopulation series, you can define an index on Country and CityPopulation, than select City and CityPopulation on that index with a specific country as the from and to arguments. This will return all cities in that country ordered by their population.

The by-index* versions of these functins allow specifying a multi-value key for a multivalue index.

procedure

(valid-only item)  boolean?

  item : any/c
A small utility function that can be used as an argument to the filter argument to the select functions, to select only rows which have actual data (i.e. not NA values). This function assumes that the NA value is always #f.

The function returns #t if all elements of item (which can be a vector or a list) are not #f.

9 Iterating Over Rows

procedure

(df-map df    
  series    
  fn    
  [#:start start    
  #:stop stop])  vector?
  df : data-frame?
  series : (or/c string? (listof string?))
  fn : mapfn/c
  start : index/c = 0
  stop : index/c = (df-row-count df)

procedure

(df-map/by-index df    
  series    
  fn    
  #:index index-name    
  [#:from from    
  #:to to])  vector?
  df : data-frame?
  series : (or/c string? (listof string?))
  fn : mapfn/c
  index-name : string?
  from : any/c = #f
  to : any/c = #f

procedure

(df-map/by-index* df    
  series    
  fn    
  #:index index-name    
  [#:from from    
  #:to to])  vector?
  df : data-frame?
  series : (or/c string? (listof string?))
  fn : mapfn/c
  index-name : string?
  from : (listof any/c) = #f
  to : (listof any/c) = #f
Apply the function fn over rows in the specified series and return a vector of the values that fn returns.

fn is a function of ether one or two arguments. If fn is a function with one argument, it is called with the values from all series as a single vector. If fn is a function of two arguments, it is called with the current and previous set of values, as vectors (this allows calculating "delta" values). I.e. fn is invoked as (fn prev current). If fn accepts two arguments, it will be invoked as (fn #f current) for the first element of the iteration.

df-map will iterate over rows in the data frame between start and stop positions, while df-map/by-index and df-map/by-index* will iterate in the order defined by the index index-name between from and to values in the indexed series.

See df-select/by-index* for a discution on the differneces between by-index and by-index* versions.

procedure

(df-for-each df    
  series    
  fn    
  [#:start start    
  #:stop stop])  void
  df : data-frame?
  series : (or/c string? (listof string?))
  fn : mapfn/c
  start : index/c = 0
  stop : index/c = (df-row-count df)

procedure

(df-for-each/by-index df    
  series    
  fn    
  #:index index-name    
  [#:from from    
  #:to to])  void
  df : data-frame?
  series : (or/c string? (listof string?))
  fn : mapfn/c
  index-name : string?
  from : any/c = #f
  to : any/c = #f

procedure

(df-for-each/by-index* df    
  series    
  fn    
  #:index index-name    
  [#:from from    
  #:to to])  void
  df : data-frame?
  series : (or/c string? (listof string?))
  fn : mapfn/c
  index-name : string?
  from : (listof any/c) = #f
  to : (listof any/c) = #f
Same as df-map and its variants, but the result of calling fn is discarded and the function returns nothing.

procedure

(df-fold df    
  series    
  init-value    
  fn    
  [#:start start    
  #:stop stop])  any/c
  df : data-frame?
  series : (or/c string? (listof string?))
  init-value : any/c
  fn : foldfn/c
  start : index/c = 0
  stop : index/c = (df-row-count df)

procedure

(df-fold/by-index df    
  series    
  init-value    
  fn    
  #:index index-name    
  [#:from from    
  #:to to])  any/c
  df : data-frame?
  series : (or/c string? (listof string?))
  init-value : any/c
  fn : foldfn/c
  index-name : string?
  from : any/c = #f
  to : any/c = #f

procedure

(df-fold/by-index* df    
  series    
  init-value    
  fn    
  #:index index-name    
  [#:from from    
  #:to to])  any/c
  df : data-frame?
  series : (or/c string? (listof string?))
  init-value : any/c
  fn : foldfn/c
  index-name : string?
  from : (listof any/c) = #f
  to : (listof any/c) = #f
Fold the function fn over rows in the specified series. init-val is the initial value for the fold operation. The last value returned by fn is returned by the folding function.

fn is a function of ether two or three arguments. If fn is a function with two arguments, it is called with the fold value plus the values from all series is passed in as a single vector. If fn is a function of three arguments, it is called with the fold value plus the current and previous set of values, as vectors (this allows calculating "delta" values). I.e. fn is invoked as (fn val prev current). If fn accepts two arguments, it will be invoked as (fn init-val #f current) for the first element of the iteration.

df-fold will iterate over rows in the data frame between start and stop positions, while df-fold/by-index and df-fold/by-index* will iterate in the order defined by the index index-name between from and to values in the indexed series.

See df-select/by-index* for a discution on the differneces between by-index and by-index* versions.

procedure

(in-data-frame df    
  [#:start start    
  #:stop stop]    
  series ...)  sequence?
  df : data-frame?
  start : index/c = 0
  stop : index/c = (df-row-count df)
  series : string?

procedure

(in-data-frame/as-list df    
  [#:start start    
  #:stop stop]    
  series ...)  sequence?
  df : data-frame?
  start : index/c = 0
  stop : index/c = (df-row-count df)
  series : string?

procedure

(in-data-frame/as-vector df    
  [#:start start    
  #:stop stop]    
  series ...)  sequence?
  df : data-frame?
  start : index/c = 0
  stop : index/c = (df-row-count df)
  series : string?
Return a sequence that produces values from a list of series between start and stop rows. The sequence produces values, each one corresponding to one of the series.

This is intended to be used in for and related constructs to iterate over elements in the data frame:

(for (([lat lon] (in-data-frame df "lat" "lon")))
  (printf "lat = ~a, lon = ~a~%" lat lon))

The in-data-frame/as-list and in-data-frame/as-vector variants work the same, but they produce a single value, a list or a vector containing a row of values from the series:

(for ((coord (in-data-frame/as-list df "lat" "lon")))
   (match-define (list lat lon) coord)
   (printf "lat = ~a, lon = ~a~%" lat lon))

procedure

(in-data-frame/by-index df    
  #:index index-name    
  [#:from from    
  #:to to]    
  series ...)  sequence?
  df : data-frame?
  index-name : string?
  from : any/c = #f
  to : any/c = #f
  series : string?

procedure

(in-data-frame/by-index/as-list df    
  #:index index-name    
  [#:from from    
  #:to to]    
  series ...)  sequence?
  df : data-frame?
  index-name : string?
  from : any/c = #f
  to : any/c = #f
  series : string?

procedure

(in-data-frame/by-index/as-vector df 
  #:index index-name 
  [#:from from 
  #:to to] 
  series ...) 
  sequence?
  df : data-frame?
  index-name : string?
  from : any/c = #f
  to : any/c = #f
  series : string?

procedure

(in-data-frame/by-index* df    
  #:index index-name    
  [#:from from    
  #:to to]    
  series ...)  sequence?
  df : data-frame?
  index-name : string?
  from : any/c = #f
  to : any/c = #f
  series : string?

procedure

(in-data-frame/by-index*/as-list df 
  #:index index-name 
  [#:from from 
  #:to to] 
  series ...) 
  sequence?
  df : data-frame?
  index-name : string?
  from : any/c = #f
  to : any/c = #f
  series : string?

procedure

(in-data-frame/by-index*/as-vector df 
  #:index index-name 
  [#:from from 
  #:to to] 
  series ...) 
  sequence?
  df : data-frame?
  index-name : string?
  from : any/c = #f
  to : any/c = #f
  series : string?
Same as the in-data-frame constructs, but these iterate over the index index-name between from and to values in the index.

See df-select/by-index* for a discution on the differneces between by-index and by-index* versions and on using from and to values.

10 Statistics

The following functions allow calculating statistics on data frame series. They build on top of the math/statistics module.

procedure

(df-set-default-weight-series! df series)  any/c

  df : data-frame?
  series : (or/c #f string?)

procedure

(df-get-default-weight-series df)  (or/c #f string?)

  df : data-frame?
Set or return the default weight series for statistics operations. This series will be used as a weight series if none is specified for df-statistics or df-quantile. Set it to #f for no weight series to be used for statistics.

A weight series needs to be used when samples in the data frame don’t have equal weight. For example, if a parameter (e.g. heart rate) is recorded at variable intervals, simply averaging the values will not produce an accurate average, if a timer series is also present, it can be used as a weight series to produce a better average.

procedure

(df-statistics df 
  series 
  [#:weight-series weight-series 
  #:start start 
  #:stop stop]) 
  (or/c #f statistics?)
  df : data-frame?
  series : string?
  weight-series : string? = (df-get-default-weight-series df)
  start : exact-nonnegative-integer? = 0
  stop : exact-nonnegative-integer? = (df-row-count df)
Compute statistics for series in the data frame df. This calls update-statistics for the values in the series. The statistics computation will use weighting if a weight series is defined for the data frame, see df-set-default-weight-series!.

procedure

(df-quantile df 
  series 
  #:weight-series string? 
  [#:less-than less-than] 
  qvalue ...) 
  (or/c #f (listof real?))
  df : data-frame?
  series : string?
  string? : (df-get-default-weight-series df)
  less-than : (-> any/c any/c boolean?) = <
  qvalue : (between/c 0 1)
Return the quantiles for the series in the data frame df. A list of quantiles is returned as specified by qvalue, or if no quantiles are specified, the list (0 0.25 0.5 0.75 1) is used. #:weight-series has the usual meaning, less-than is the ordering function passed to the quantile function.

11 Least Squares Fitting

struct

(struct least-squares-fit (type coefficients residual fn)
    #:extra-constructor-name make-least-squares-fit)
  type : (or/c 'linear 'polynomial 'power 'exponential 'logarithmic)
  coefficients : (listof real?)
  residual : (or/c #f real?)
  fn : (-> real? real?)
Return value for the df-least-squares-fit function, containing the fiting mode and coefficients for the function. The structure can be applied directly as a procedure and acts as the fit function.

procedure

(df-least-squares-fit df 
  xseries 
  yseries 
  [#:start start 
  #:stop stop 
  #:mode mode 
  #:polynomial-degree degree 
  #:residual? residual? 
  #:annealing? annealing? 
  #:annealing-iterations iterations]) 
  least-squares-fit?
  df : data-frame?
  xseries : string?
  yseries : string?
  start : exact-nonnegative-integer = 0
  stop : exact-nonnegative-integer = (df-row-count df)
  mode : (or/c 'linear 'polynomial 'poly 'power 'exponential 'exp 'logarithmic 'log)
   = 'linear
  degree : exact-nonnegative-integer = 2
  residual? : boolean? = #f
  annealing? : boolean = #f
  iterations : exact-nonnegative-integer? = 500
Return a best fit function for the xseries and yseries in the data frame df. This function returns a least-squares-fit structure instance. The instance can be applied directly as a function, being the best fit function for the input data.

start and stop specify the start and end position in the series, by default all values are considered for the fit.

mode determines the type of the function being fitted and can have one of the following values:

residual? when #t indicates that the residual value is also returned in the ‘least-squares-fit‘ structure. Setting it to #f will avoid some unnecessary computations.

annealing? when #t indicates that the fit coefficients should be further refined using the annealing function. This is only used for 'exponential or

'power fit functions as these ones do not produce "best fit" coefficients – I don’t know why, I am not a mathematician, I only used the formulas. Using annealing will significantly improve the fit for these functions, but will still not determine the best one. Note that the annealing algorithm is probabilistic, so applying it a second time on the same arguments will produce a slightly different result.

iterations represents the number of annealing iterations, see the #:iterations parameter to the ‘annealing‘ function.

12 Histograms and histogram plots

procedure

(df-histogram df 
  series 
  [#:weight-series weight-series 
  #:bucket-width bucket-width 
  #:trim-outliers trim-outliers 
  #:include-zeroes? include-zeroes? 
  #:as-percentage? as-percentage?] 
  #:start start 
  #:stop stop) 
  (or/c #f histogram/c)
  df : data-frame?
  series : string?
  weight-series : (or/c #f string?)
   = (df-get-default-weight-series df)
  bucket-width : real? = 1
  trim-outliers : (or/c #f (between/c 0 1)) = #f
  include-zeroes? : boolean? = #t
  as-percentage? : boolean? = #f
  start : exact-nonnegative-integer?
  stop : exact-nonnegative-integer?
Create a histogram for series from the data frame df between rows start and stop (which default to all the rows). The returned is a vector of values, each value is a vector of two values, the sample and the rank of that sample.

weight-series specifies the series to be used for weighting the samples. By default, it it uses the 'weight property stored in the data-frame, see df-set-default-weight-series!. Use #f for no weighting, in this case, each sample will have a weight of 1.

bucket-width specifies the width of each histogram slot. Samples in the data series are grouped together into slots, which are from 0 to bucket-width, than from bucket-width to (* 2 bucket-width) and so on. The bucket-width value can be less than 1.0.

trim-outliers specifies to remove slots from both ends of the histogram that contain less than the specified percentage of values. When #f on slots are trimmed.

include-zeroes? specifies whether samples with a slot of 0 are included in the histogram or not. Note that slot 0 contains samples from 0 to bucket-width.

as-percentage? determines if the data in the histogram represents a percentage (all ranks add up to 100) or it is the rank of each slot.

In the resulting histogram, samples that are numbers or strings will be sorted. In addition, if the samples are numbers, empty slots will be created so that the buckets are also consecutive.

procedure

(histogram-renderer histogram 
  [#:color color 
  #:skip skip 
  #:x-min x-min 
  #:label label 
  #:blank-some-labels blank-some-labels? 
  #:x-value-formatter formatter]) 
  (treeof renderer2d?)
  histogram : histogram/c
  color : any/c = #f
  skip : real? = (discrete-histogram-skip)
  x-min : real? = 0
  label : string? = #f
  blank-some-labels? : boolean? = #t
  formatter : (or/c #f (-> number? string?)) = #f
Create a histogram plot renderer from data, which is a histogram created by df-histogram.

color determines the color of the histogram bars.

label specifies the label to use for this plot renderer.

skip and x-min are used to plot dual histograms, see histogram-renderer/dual.

All the above arguments are sent directly to the discrete-histogram

blank-some-labels?, controls if some of the labels are blanked out if the plot contains too many values, this can produce a nicer looking plot.

formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".

procedure

(combine-histograms h1 h2)  combined-histogram/c

  h1 : histogram/c
  h2 : histogram/c
Combine two histograms produced by df-histogram into a single one. The result of this function is intended to be passed to histogram-renderer/dual.

procedure

(histogram-renderer/dual combined-histogram 
  label1 
  label2 
  [#:color1 color1 
  #:color2 color2 
  #:x-value-formatter formatter]) 
  (treeof renderer2d?)
  combined-histogram : combined-histogram/c
  label1 : string?
  label2 : string?
  color1 : any/c = #f
  color2 : any/c = #f
  formatter : (or/c #f (-> number? string?)) = #f
Create a plot renderer that shows two histograms, with each slot side-by-side. The histograms can be produced by df-histogram and combined by combined-histogram.

label1 and color1 represent the label and colors for the first histogram, label2 and color2 represent the label and colors to use for the second histogram.

formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".

procedure

(histogram-renderer/factors histogram 
  factor-fn 
  factor-colors 
  [#:x-value-formatter formatter]) 
  (treeof renderer2d?)
  histogram : histogram/c
  factor-fn : (-> real? symbol?)
  factor-colors : (listof (cons/c symbol? color/c))
  formatter : (or/c #f (-> number? string?)) = #f
Create a histogram renderer where histogram is split into sections by factor-fn and each section is colored according to factor-colors.

formatter controls how the histogram values are displayed. By default, labels for the values are displayed with ~a, but this function can be used for custom formatter. For example, if the values in the histogram represent running pace, the formatter can transform a value of 300 into the label "5:00".

13 GPX Files

 (require data-frame/gpx) package: data-frame

This module provides functions for reading and writing data frames using the GPS Exchange Format (GPX).

procedure

(df-read/gpx input)  data-frame?

  input : (or/c path-string? input-port?)
Construct a data frame from the GPX document specified in input, which is either an input port or a string, in which case it denotes an input file. The data frame will have one or more of the following series:

The data frame will also have the following properties:

All the track segments in the GPX file will be concatenated.

procedure

(df-write/gpx df    
  output    
  [#:name name    
  #:extra-series extra-series    
  #:start start    
  #:stop stop])  any/c
  df : data-frame?
  output : (or/c path-string? output-port?)
  name : (or/c #f string?) = #f
  extra-series : (listof string?)
   = '("hr" "cad" "pwr" "spd" "dst")
  start : exact-nonnegative-integer? = 0
  stop : exact-nonnegative-integer? = (df-row-count df)
Export the GPS track from the data frame df to output, which is either an output port or a string, in which case it denotes a file name.

The data frame is expected to contain the "timestamp", "lat", "lon" series, and optionally "alt" or "calt" (corrected altitude) series. In addition to these series, optional heart rate, cadence, speed, power and distance data can also be written out by specifying a list of series names in extra-series, series which don’t exist will be silently discarded. Series which exist, but we don’t know how to write them out are also silently discarded (e.g. no "gpxdata:" tag)

The entire GPS track is exported as a single track segment, unless start and stop positions are specified, in which case only data between these positions is exported (this can be used to export a subset of the data)

The laps property, if present, is assumed to contain a list of timestamps and the positions corresponding to these timestamps are exported as way points.

The name of the segment can be specified as the name parameter. If this is #f, the 'name property in the data frame is consulted, if that one is missing a default track name is used.

14 TCX Files

 (require data-frame/tcx) package: data-frame

This module provides functions for reading Training Center XML (TCX) files into data frames.

procedure

(df-read/tcx input)  data-frame?

  input : (or/c path-string? input-port?)
Construct a data frame from the first activity in the TCX document specified in input, which is either an input port or a string, in which case it denotes an input file. The data frame will have one or more of the following series:

The data frame may also have the following properties (if they are present in the TCX document):

procedure

(df-read/tcx/multiple input)  (listof data-frame?)

  input : (or/c path-string? input-port?)
Construct a list of data frames, one for each activtiy in the TCX document specified in input, which is either an input port or a string, in which case it denotes an input file. See df-read/tcx for the contents of each data frame object.