|(require tabular-asa)||package: tabular-asa|
A fast, efficient, immutable, dataframes implementation.
The source code can be found at https://github.com/massung/tabular-asa.
This is a brief example of loading a table from a CSV, filtering, grouping, aggregating, and plotting the data. Note: This example uses ~> from the threading module for clarity, but Tabular Asa does not require it.
(define books (call-with-input-file "books.csv" table-read/csv)) (let ([df (~> books (table-drop-na '(Publisher)) (table-cut '(Genre Title)) (table-groupby '(Genre)) (group-count))]) (parameterize ([plot-x-tick-label-angle 30] [plot-x-tick-label-anchor 'top-right]) (plot (discrete-histogram (for/list ([x (table-column df 'Genre)] [y (table-column df 'Title)]) (list x y))) #:x-label "Genre" #:y-label "Number of Titles Published")))
Tabular Asa is intended to fulfill the following goals:
Be as efficient as possible sharing memory.
Be as lazy as possible; use sequences and streams.
Be usable for everyday tabular data tasks.
Be extremely simple to understand and extend.
Be flexible by exposing low-level functionality.
Be a learning example for students who want to learn how to implement databases.
Tabular Asa does this by starting with a couple very simple concepts and building on them. In order, those concepts are:
When thinking about tabular data, it’s very common to think of each row (or record) as a thing to be grouped together. However, this is extremely inefficient for most operations; it requires extracting data from a larger collection into a smaller collection for many operations. It is also an inefficient use of cache. For this reason Tabular Asa is column-major.
A simple example of this difference in implementation would be cutting or inserting columns (a SELECT operation in SQL) to a table. Consider the following table of data:
When rows are stored as a sequence or hash, removing or adding a column requires duplicating every single row of data and copying it into a new sequence or hash, essentially doubling the memory usage and increasing the time it takes to perform the operation. However, if the table is stored as 3 columns, then creating a new table with a column added only adds the memory cost of the new column. Selecting a subset of columns is even easier.
Additionally, tables contain an a vector which is the index of which rows it sees. This allows for tables that are filters to simply reference the existing column data, but with a new index. In the above example table, the index would be the vector #(0 1 2). If a new table was generated by filtering the original, keeping only the girls, then the new table would contain all the same column data, but the index would be #(2).
Likewise, the symbol k is used in place of a column name (a symbol?) and the symbol ks is used for a list of column names.
It is important to note that - when reading tables - columns that don’t already exist will be generated on-demand. The column names will be equivelant to calling (gensym "col") to ensure they are unique symbols.
If jsexpr is a JSON object (hash-eq?), then it is assumed to be a hash of columns, where each column contains the values for it.
(table-read/csv port [ #:header? header #:drop-index? drop-index #:separator-char sep #:quote-char quote #:double-quote? double-quote #:comment-char comment #:strip? strip #:na na #:na-values na-values]) → table? port : input-port? header : boolean? = #t drop-index : boolean? = #f sep : char? = #\, quote : char? = #\" double-quote : char? = #t comment : char? = #\# strip : boolean? = #f na : any/c = #f
na-values : (listof string?) = (list "" "." "na" "n/a" "nan" "null")
The header argument - if #t - indicates that the first non-comment row of the CSV should be treated as the list of column names. If false then the column names will be generated as needed.
The drop-index arugment - if #t - assumes that the first column of the CSV is the row index (i.e., an auto-incrementing integer) and shouldn’t be kept. If there is a row index column, and it is not dropped, it’s important to note that it will be treated just like any other column and is NOT used as the table’s index.
The na-values argument is a list of strings that - when parsed as the value for a given cell - are replaced with the na value to indicate N/A (not available). The values in this list are case-insensitive.
(new table-builder% [ [initial-size initial-size] [columns columns] [sort-columns sort-columns]]) → (is-a?/c table-builder%) initial-size : exact-nonnegative-integer? = 5000 columns : (listof symbol?) = '() sort-columns : boolean? = #fCreates a new table-builder% with an initial shape.
The initial-size is how many rows are initially reserved for each column.
The columns is the initial list (and order) of column names. Columns may be added dynamically as rows and records are appended to the table.
The sort-columns parameter makes it so that - upon building the table - the columns are sorted alphabetically. This is useful when building a table from records and you want to ensure a consistent ordering.
(send a-table-builder add-column name [ backfill]) → void? name : symbol? backfill : any/c = #fAppends a new column to the table being built. Any rows already added to the table will be be backfilled with backfill.
Typically, this method need not be called manually, as it will automatically be called as-needed by add-row and add-record.Appends a new row of values to the table. If ks is #f (the default), then the current set of columns (in order) of the table is assumed. If the row contains more values than there are columns then additional columns will be added to the table with generated names.Appends a new row of values to the table. The record is assumed to be a hash-eq? of (k . value) pairings. If the record contains a column name not yet present in the table, a new column is created for it.Builds a new index for the table, truncates the column data and returns it.
This method may be called more than once, and each time it will return a new table. This allows you to do things like add rows, build a table, add more rows and columns, build another table, etc.
(for/table (init-forms ...) (for-clause ...) body-or-break ... body)
The init-forms are the optional, initial forms used to create a table-builder% instance.
Each iteration of body should return either a list? or hash-eq?, which will automatically be sent to the table-builder% using add-row or add-record. It’s also possible to mix and match (i.e., return a list for one iteration and a hash for another).
When the for-clause terminates, the table is built and returned.
(for/table ([initial-size 3] [columns '(hero universe)]) ([i 3]) (case i ((0) '("Superman" "DC")) ((1) '("Wolverine" "Marvel")) ((2) '("Batman" "DC"))))
#:extra-constructor-name make-table) index : (vectorof exact-nonnegative-integer?) data : (listof (cons/c symbol? (vectorof any/c)))
All tables are also sequences and can be iterated using for, where each iteration returns the next index and row (list). For example:
(define df (table #(0 1 2) '((hero . #("Superman" "Batman" "Wonder Woman")) (gender . #(m m f))))) (for ([(i row) df]) (displayln row))
(table-preview proc) → void? proc : (table? output-port? -> void?)
#<table [359 rows x 8 cols]>
If the data sequence contains fewer elements than there are rows in the table, then the extra rows will be filled with #f for the new column. Likewise, if data contains more values than there are rows, the extra values will be dropped.
It’s important to note that the vector for the column generated will be as large as necessary for it to be indexed properly by the table! For example, let’s say you begin with a table that has 1M rows and filter it, which returns a new table with a single row. If the index of that single row is 999999 (i.e., the last row), then adding a new column will create a vector with 1M entries, all but one of which will contain #f.
(table-with-columns-renamed df #hasheq((person . name)))
For a more concrete example of this, imagine a brand new table with a single column of 3 value: #(a b c); it has an index of #(0 1 2). Now, reverse the table; the index is now #(2 1 0). Calling (table-irow df 2) will return '(c), because that’s the value at index 2. However, calling (table-row df 2) will return '(a), because that’s the value of the third row; the table has been reversed.
This can be seen in action easily with the following code:
(let ([df (table-reverse (table-with-column empty-table '(a b c)))]) (for ([(i row) df] [n (in-naturals)]) (displayln (format " for ~a = ~v" i row)) (displayln (format "irow ~a = ~v" i (table-irow df i))) (displayln (format " row ~a = ~v" n (table-row df n)))))
The for loop follows the index, so index 2 should be output first, which is reference row 0. The last index output is 0, which is the last reference row (2).
See the comment for table-irow for more details.
df : table? proc : procedure? ks : (or/c (non-empty-listof symbol?) #f) = #f
df : table? proc : procedure? ks : (or/c (non-empty-listof symbol?) #f) = #f
(table-update df k proc [ #:ignore-na? ignore-na]) → table? df : table? k : symbol? proc : procedure? ignore-na : boolean? = #t
If ignore-na is #t (the default), then all #f values are returned as #f instead of being updated.
df : table? proc : (any/c any/c -> any/c) i : any/c final : (any/c -> any/c) = identity
(table-groupby df ks [less-than?])
→ (sequence/c (listof (list/c symbol? any/c)) table?) df : table? ks : (non-empty-listof symbol?)
less-than? : (or/c (any/c any/c -> boolean?) #f) = sort-ascending
df : table? ks : (or/c (non-empty-listof symbol?) #f) = #f less-than? : (any/c any/c -> boolean?) = sort-ascending
(table-join/inner df other on [ less-than? #:with with]) → table? df : table? other : table? on : (non-empty-listof symbol?) less-than? : (any/c any/c -> boolean?) = sort-ascending with : (non-empty-listof symbol?) = on
(table-join/outer df other on [ less-than? #:with with]) → table? df : table? other : table? on : (non-empty-listof symbol?) less-than? : (any/c any/c -> boolean?) = sort-ascending with : (non-empty-listof symbol?) = on
(pretty-print-rows) → (or/c exact-nonnegative-integer? #f)
(pretty-print-rows rows) → void? rows : (or/c exact-nonnegative-integer? #f)
(write-table df [ port mode #:keep-index? keep-index]) → void? df : table? port : output-port? = (current-output-port) mode : boolean? = #t keep-index : boolean? = #t
(print-table df [ port #:keep-index? keep-index]) → void? df : table? port : output-port? = (current-output-port) keep-index : boolean? = #t
(display-table df [ port #:keep-index? keep-index]) → void? df : table? port : output-port? = (current-output-port) keep-index : boolean? = #t
df : table? port : output-port? = (current-output-port)
(table-write/csv df [ port #:keep-index? keep-index #:header? header #:separator-char sep #:quote-char quote #:escape-char escape #:list-char list-sep #:double-quote? double-quote #:na-rep na #:na-values na-values]) → void? df : table? port : output-port? = (current-output-port) keep-index : boolean? = #t header : boolean? = #t sep : char? = #\, quote : char? = #\" escape : char? = #\\ list-sep : char? = #\| double-quote : boolean? = #t na : string? = "" na-values : (listof any/c) = (quote(#f))
(table-write/json df [ port #:orient orient #:lines? lines #:na-rep na]) → void? df : table? port : output-port? = (current-output-port) orient : (or/c 'records 'columns) = 'records lines : boolean? = #t na : any/c = (json-null)
If orient is 'records (the default) then every row of the table is written as an array of JSON objects. If lines is #t (and orient is 'records) then instead of an array, then each row is written as a record on each line.
If orient is 'columns then the table is written as a single JSON object, where each key/value pairing is the column name and an array of values.
The na determines what Racket value in written out as a JSON null.
#:extra-constructor-name make-column) name : symbol? index : (vectorof exact-nonnegative-integer?) data : (vectorof any/c)
empty-column : column?
col : column? less-than? : ((any/c any/c) -> boolean?) = sort-ascending
group : (sequence/c (listof (list/c symbol? any/c)) table?) less-than? : (any/c any/c -> boolean?) = sort-ascending
group : (sequence/c (listof (list/c symbol? any/c)) table?) greater-than? : (any/c any/c -> boolean?) = sort-descending
#:extra-constructor-name make-index) keys : (vectorof (cons/c any/c (vectorof exact-nonnegative-integer?))) less-than? : (or/c ((any/c any/c) -> boolean?) #f)
data : (sequenceof any/c)
less-than? : (or/c ((any/c any/c) -> boolean?) #f) = sort-ascending
ix : index? from : any/c = #f to : any/c = #f
If the index is not sorted then the order of the sequence returned is undefined.
ix : index? from : any/c = #f to : any/c = #f
If key is found, then the reference index to the keys is returned. When not found, if exact is #t then #f is returned. Otherwise, the next higher index for the next key is returned.
(index-member ix key)
→ (or/c (list/c any/c exact-nonnegative-integer? ...) #f) ix : index? key : any/c
ix : index? n : exact-nonnegative-integer?
ix : index? v : (vectorof any/c) from : any/c = #f to : any/c = #f
All functions that allow for sorting (e.g. table-sort) or indexing/grouping take an optional less-than? compare function. Tabular Asa comes with a generic orderable interface with sort-ascending and sort-descending functions for the following basic types:
Both generic functions will always sort #f values last regardless of sort direction.