|(require rml/data)||package: rml-core|
This module deals with two opaque structure types, data-set and data-set-field. These are not available to clients directly although certain accessors are exported by this module. Conceptually a data-set is a table of data, columns represent fields that are either features that represent properties of an instance, and classifiers or labels that are used to train and match instances.
> (require rml/data)
> (define dataset (load-data-set "test/iris_training_data.csv" 'csv (list (make-feature "sepal-length" #:index 0) (make-feature "sepal-width" #:index 1) (make-feature "petal-length" #:index 2) (make-feature "petal-width" #:index 3) (make-classifier "classification" #:index 4)))) > (displayln (data-set? dataset))
> (displayln (features dataset))
(sepal-length sepal-width petal-length petal-width)
> (displayln (classifiers dataset))
> (displayln (partition-count dataset))
> (displayln (data-count dataset))
> (displayln (classifier-product dataset))
(Iris-versicolor Iris-virginica Iris-setosa)
In this code block a training data set is loaded and the columns within the CSV data are described.
file-name : string? format : symbol? fields : (listof data-set-field?)
(feature-vector dataset partition-id feature-name) → (vectorof number?) dataset : data-set? partition-id : exact-nonnegative-integer? feature-name : string?
dataset : data-set? partition-id : exact-nonnegative-integer?
default-partition : exact-nonnegative-integer?
test-partition : exact-nonnegative-integer?
training-partition : exact-nonnegative-integer?
The following procedures perform transformations on one or more data-set structures and return a new data-set. These are typically concerned with partitioning a data set or optimizing the feature vectors.
(partition-equally partition-count [ entropy-features]) → data-set? partition-count : exact-positive-integer? entropy-features : (listof string?) = '()
(partition-for-test test-percentage [ entropy-features]) → data-set? test-percentage : (real-in 1.0 50.0) entropy-features : (listof string?) = '()
If specified, the entropy-features list denotes the names of features, or classifiers, that should be randomly spread across partitions.
(minimum-partition-data-total) → exact-positive-integer?
(minimum-partition-data-total partition-data-count) → void? partition-data-count : exact-positive-integer?
(minimum-partition-data) → exact-positive-integer?
(minimum-partition-data partition-data-count) → void? partition-data-count : exact-positive-integer?
Loading and manipulating data sets from source files may not always be efficient and so the parsed in-memory format can be saved and loaded externally. These saved forms are termed snapshots, they are serialized forms of the data-set structure.