1 Module rml/data

1 Module rml/data🔗ℹ

This module deals with two opaque structure types, data-set and data-set-field. These are not available to clients directly although certain accessors are exported by this module. Conceptually a data-set is a table of data, columns represent fields that are either features that represent properties of an instance, and classifiers or labels that are used to train and match instances.

Examples:

> (require rml/data)
> (define dataset
    (load-data-set "test/iris_training_data.csv"
                   'csv
                   (list
                    (make-feature "sepal-length" #:index 0)
                    (make-feature "sepal-width" #:index 1)
                    (make-feature "petal-length" #:index 2)
                    (make-feature "petal-width" #:index 3)
                    (make-classifier "classification" #:index 4))))
> (displayln (data-set? dataset))
#t
> (displayln (features dataset))
(sepal-length sepal-width petal-length petal-width)
> (displayln (classifiers dataset))
(classification)
> (displayln (partition-count dataset))
1
> (displayln (data-count dataset))
135
> (displayln (classifier-product dataset))
(Iris-versicolor Iris-virginica Iris-setosa)

In this code block a training data set is loaded and the columns within the CSV data are described.

1.1 Types and Predicates🔗ℹ

predicate
(data-set? a) → boolean?
a : any

Determines whether the value a is a data-set structure, primarily used as a contract predicate.

predicate
(data-set-field? a) → boolean?
a : any

Determines whether the value a is a data-set-field structure, primarily used as a contract predicate.

predicate
(partition-id? a) → boolean?
a : any

Determines whether the value a is a partition identifier, primarily used as a contract predicate.

1.2 Construction🔗ℹ

procedure
(load-data-set file-name format fields) → data-set?
  file-name : string?
  format : symbol?
  fields : (listof data-set-field?)

Returns a new data-set, with the specified features and classifiers, from the specified file.

value
supported-formats : (listof symbol?)

Returns a list of file formats supported by the load-data-set function.

constructor
(make-feature name #:index integer?) → (data-set-field?)
name : string?
integer? : 0

Create a new data-set-field as a feature, with the name name, and the source column index of index. The index value is important for formats that do not support name mapping such as CSV.

constructor
(make-classifier name #:index integer?) → (data-set-field?)
name : string?
integer? : 0

Create a new data-set-field as a classifier, with the name name, and the source column index of index. The index value is important for formats that do not support name mapping such as CSV.

1.3 Accessors🔗ℹ

accessor
(classifiers dataset) → (listof string?)
dataset : data-set?

The name of all classifier features in the data set.

accessor
(classifier-product dataset) → (listof string?)
dataset : data-set?

Returns a list with each row being the cartesian product of the unique values of each classifier feature. All classifier features are treated as strings and the product is separated by the Unicode times character "⨉".

accessor
(features dataset) → (listof string?)
dataset : data-set?

The name of all feature features in the data set.

accessor
(data-count dataset) → exact-nonnegative-integer?
dataset : data-set?

The number of data rows in the data set, in all partitions.

accessor
(feature-vector dataset
partition-id
feature-name) → (vectorof number?)
  dataset : data-set?
  partition-id : exact-nonnegative-integer?
  feature-name : string?

The vector of underlying data, in the given partition, for the feature feature-name.

accessor
(partition-count dataset) → exact-nonnegative-integer?
dataset : data-set?

The number of partitions in the data set, when initially created this is usually 1.

accessor
(partition dataset partition-id) → (vectorof vector?)
dataset : data-set?
partition-id : exact-nonnegative-integer?

The partition data itself (a vector of feature vectors).

value
default-partition : exact-nonnegative-integer?

The identifier for the default parttion created by load-data-set.

value
test-partition : exact-nonnegative-integer?

The identifier for the default test data parttion created by partition-for-test.

value
training-partition : exact-nonnegative-integer?

The identifier for the default training data parttion created by partition-for-test.

1.4 Transformations🔗ℹ

The following procedures perform transformations on one or more data-set structures and return a new data-set. These are typically concerned with partitioning a data set or optimizing the feature vectors.

procedure
(partition-equally partition-count
[ entropy-features]) → data-set?
partition-count : exact-positive-integer?
entropy-features : (listof string?) = '()

Return a new data-set that attempts to partition the original data into partition-count equal groups (equal in number of rows in their feature vectors). If specified, the entropy-features list denotes the names of features, or classifiers, that should be randomly spread across partitions.

procedure
(partition-for-test test-percentage
[ entropy-features]) → data-set?
test-percentage : (real-in 1.0 50.0)
entropy-features : (listof string?) = '()

Return a new data-set that attempts to partition the original data into two new partitions with test-percentage of rows separated out to act as test data and the remainder the training data.

If specified, the entropy-features list denotes the names of features, or classifiers, that should be randomly spread across partitions.

parameter
(minimum-partition-data-total) → exact-positive-integer?
(minimum-partition-data-total partition-data-count) → void?
partition-data-count : exact-positive-integer?
= 100

This parameter is used to control the partition-equally and partition-for-test functions and denotes the minimum number of rows in the source partition to make sense to sub-divide.

parameter
(minimum-partition-data) → exact-positive-integer?
(minimum-partition-data partition-data-count) → void?
partition-data-count : exact-positive-integer?
= 100

This parameter is used to control the partition-equally and partition-for-test functions and denotes the minimum number of rows that would result in each constructed partition.

1.5 Snapshots🔗ℹ

Loading and manipulating data sets from source files may not always be efficient and so the parsed in-memory format can be saved and loaded externally. These saved forms are termed snapshots, they are serialized forms of the data-set structure.

io
(write-snapshot dataset out) → void?
dataset : data-set?
out : output-port?

Write a snapshot of the data set dataset to the output port out. The snapshot also contains a version number representing the data set structure; this ensures that the snapshot can be read correctly in the future.

io
(read-snapshot dataset in) → data-set?
dataset : data-set?
in : input-port?

Read a snapshot from the input port in and returning a data-set structure. Reading will cause an exception if the data set version number is incompatible.

1	Module rml/ data
2	Module rml/ individual
3	Module rml/ classify
4	Module rml/ statistics
5	Module rml/ gini
6	Module rml/ results
7	Module rml/ not-implemented

1.1	Types and Predicates
1.2	Construction
1.3	Accessors
1.4	Transformations
1.5	Snapshots