4 Module rml/statistics

4 Module rml/statistics🔗ℹ

This module provides capabilities to compute statistical data over the underlying data for features in data sets. This assumes features are numeric and uses the math/statistics module for actual calculations.

Examples:

> (require rml/data)
> (define dataset
    (load-data-set "test/iris_training_data.csv"
                   'csv
                   (list
                    (make-feature "sepal-length" #:index 0)
                    (make-feature "sepal-width" #:index 1)
                    (make-feature "petal-length" #:index 2)
                    (make-feature "petal-width" #:index 3)
                    (make-classifier "classification" #:index 4))))
> (define stats (compute-statistics iris-data-set))
> stats
'#hash(("petal-length" . #<future>)
       ("petal-width" . #<future>)
       ("sepal-length" . #<future>)
       ("sepal-width" . #<future>))
> (feature-statistics stats "sepal-length")
(statistics 4.4 7.9 135.0 ...)
> (standardize-statistics iris-data-set stats)
#<data-set>

predicate
(statistics-hash? a) → boolean?
a : any?

Returns #t if the value a is a hash of strings to statistics computations.

procedure
(compute-statistics dataset [feature-names]) → statistics-hash?
dataset : data-set?
feature-names : (or/c #f (listof string?)) = #f

Initiates the calculation of statistics for each feature named in feature-names or all features in the passed data-set if feature-names is #f.

These are performed concurrently. The result is a hash of string names to statistics structures (or a future if the computation has not yet completed). Using the feature-statistics accessor will always return a statistics structure.

accessor
(feature-statistics stats-hash
feature-name) → statistics-hash?
stats-hash : statistics-hash?
feature-name : string?

Return the statistics structure for the feature feature-name. If the provided name is not a key in the underlying hash, the value #f is returned.

transform
(standardize-statistics dataset
statistics-hash) → data-set?
dataset : data-set?
statistics-hash : statistics-hash?

Standardization requires statistics be computed for all features included in stats-hash and will normalize the values to reduce the effect of large outlyer values and enable more efficient distance measures.

From Scholarpedia:

… removes scale effects caused by use of features with different measurement scales. For example, if one feature is based on patient weight in units of kg and another feature is based on blood protein values in units of ng/dL in the range [-3,3], then patient weight will have a much greater influence on the distance between samples and may bias the performance of the classifier. Standardization transforms raw feature values into z-scores using the mean and standard deviation of a feature values over all input samples }

1	Module rml/ data
2	Module rml/ individual
3	Module rml/ classify
4	Module rml/ statistics
5	Module rml/ gini
6	Module rml/ results
7	Module rml/ not-implemented