On this page:
3.1 Working with Corpus Objects
current-corpus
get-instance-info-set
get-checksum-table
checksum-table/  c
corpus-get-instance-info-set
corpus-get-checksum-table
3.2 Creating Corpus Objects
corpus%
new
get-instance-info-set
get-checksum-table
empty-corpus
directory-corpus%
new
3.3 Deriving New Corpus Classes
corpus-mixin
super-docs
super-docs-evt
super-docs-evt?
corpus-mixin+  interface
define-corpus-mixin+  interface
corpus<%>
0.5.91

3 High-level Corpus Functionality🔗

The bindings documented in this section are provided by ricoeur/tei, but not by ricoeur/tei/base.

Many applications work with entire collections of TEI documents at least as often as with individual documents. This library provides corpus objects (instances of corpus% or a subclass) to bundle collections of TEI documents with related functionality. The corpus object system is also the primary hook for tools to integrate with the larger Digital Ricœur application architecture.

3.1 Working with Corpus Objects🔗

parameter

(current-corpus)  (is-a?/c corpus%)

(current-corpus corpus)  void?
  corpus : (is-a?/c corpus%)
 = empty-corpus
Contains a corpus object for use by high-level functions like get-instance-info-set and get-checksum-table.

In practice, this parameter should usually be initialized with a directory-corpus% instance.

Returns an instance set containing an instance info value for each TEI document encapsulated by (current-corpus).

Note that the returned instance set does not contain the TEI document values with which the corpus was created. Corpus objects generally avoid retaining their encapsulated TEI document values after initialization. Currently, the result of (get-instance-info-set) always satisfies (instance-set/c plain-instance-info?), but that is not guaranteed to be true in future versions of this library.

procedure

(get-checksum-table)  checksum-table/c

value

checksum-table/c : flat-contract?

 = 
(hash/c symbol?
        symbol?
        #:immutable #t)
Returns an immutable hash table summarizing the identity of the TEI documents encapsulated by (current-corpus).

For each TEI document doc, the returned hash table will have a key of (instance-title/symbol doc) mapped to the value (tei-document-checksum doc). Thus, any two corpus objects that return equal? hash tables, even across runs of the program, are guaranteed to encapsulate the very same TEI documents.

procedure

(corpus-get-instance-info-set corpus)  (instance-set/c)

  corpus : (is-a?/c corpus%)

procedure

(corpus-get-checksum-table corpus)  checksum-table/c

  corpus : (is-a?/c corpus%)
Like get-instance-info-set and get-checksum-table, respectively, but using corpus instead of (current-corpus).

3.2 Creating Corpus Objects🔗

class

corpus% : class?

  superclass: object%

A corpus object is an instance of corpus% or of a subclass of corpus%.
For many purposes, directory-corpus% offers more convienient initialization than corpus%.
Note that creating a new instance of corpus% often involves a fair amount of overhead, so creating redundant values should be avoided. Reusing corpus objects may also improve search performance through caching, for example.

constructor

(new corpus% [[docs docs]])  (is-a?/c corpus%)

  docs : (instance-set/c tei-document?) = (instance-set)
Constructs a corpus object encapsulating docs.

method

(send a-corpus get-instance-info-set)  (instance-set/c)

This method is final, so it cannot be overridden.

method

(send a-corpus get-checksum-table)  checksum-table/c

This method is final, so it cannot be overridden.

An empty corpus object used as the default value of the current-corpus parameter.

With empty-corpus, get-instance-info-set always returns (instance-set) and get-checksum-table always returns #hasheq().

class

directory-corpus% : class?

  superclass: corpus%

Extends corpus% for the common case of using TEI documents from some directory in the filesystem.

constructor

(new directory-corpus% 
    [path path] 
    [[search-backend search-backend]]) 
  (is-a?/c directory-corpus%)
  path : 
(and/c path-string-immutable/c
       directory-exists?)
  search-backend : search-backend/c = '(eager noop)
Constructs a corpus object from every file in path, including recursive subdirectories, that is recognized by xml-path?. If any such file is not a valid and well-formed TEI XML file satisfying Digital Ricœur’s specification, it will be silently ignored. If more than one of the resulting TEI document values correspond to the same instance, one will be chosen in an unspecified manner and the others will be silently ignored.

If path is a relative path, it is resolved relative to (current-directory).

The search-backend argument determines the search backend as with corpus%.

3.3 Deriving New Corpus Classes🔗

Clients of this library will want to extend the corpus object system to support additional features by implementing new classes derived from corpus%. There are two main points where derived classes will want to interpose on corpus%’s initialization:
  1. A few classes, like directory-corpus%, will want to supply an alternate means of constructing the full instance set of TEI documents to be encapsulated by the corpus object. This is easily done using standard features of the racket/class object system, such as init and super-new, to control the initialization of the base class.

  2. More often, derived classes will want to use the complete instance set of TEI documents to initialize some extended functionality: for example, corpus% itself extends a primitive, unexported class this way to initialize a searchable document set. The ricoeur/tei library provides special support for these kinds of extensions through three syntactic forms: corpus-mixin, corpus-mixin+interface, and define-corpus-mixin+interface. Most clients should use define-corpus-mixin+interface, but it is best understood as an extension of the simpler forms.

syntax

(corpus-mixin [from<%> ...] [to<%> ...]
  mixin-clause ...+)
 
  from<%> : interface?
  to<%> : interface?
Like mixin, but cooperates with corpus% and the super-docs and super-docs-evt forms to provide access to the encapsulated instance set of TEI documents as a “virtual” initialization variable. The corpus<%> interface is implicitly added to corpus-mixin’s from<%> interfaces.

Most clients should use the higher-level corpus-mixin+interface or define-corpus-mixin+interface, rather than using corpus-mixin directly.

A key design consideration is that a corpus% instance does not keep its TEI documents reachable after its initialization, as TEI document values can be rather large. Derived classes are urged to follow this practice: they should initialize whatever state they need for their extended functionality, but they should allow the TEI documents to be garbage-collected as soon as possible.

Concretely, this means that corpus% does not store the instance set of TEI documents in a field (neither public nor private), as objects’ fields are reachable after initialization.

Instead, derived classes can access the instance set of TEI documents during initialization using super-docs or super-docs-evt:

syntax

(super-docs)

Within corpus-mixin and related forms, evaluates to the full instance set of TEI documents to be encapsulated by the corpus object as a “virtual” initialization variable: using (super-docs) anywhere that an initialization variable is not allowed is a syntax error.

The instance set of TEI documents is created by the corpus% constructor: evluating (super-docs) before the superclass constructor has been called (e.g. via super-new) will raise an exception, analagous to accessing an uninitialized field.

Within corpus-mixin and related forms, similar to super-docs, but produces a synchronizable event which produces the instance set of TEI documents as its synchronization result.

Unlike (super-docs), (super-docs-evt) may be evaluated before the superclass constuctor is called and may immediately be used with sync in a background thread (e.g. via delay/thread). The event will become ready for synchronization when the corpus% constructor is called. Note that (begin (sync (super-docs-evt)) (super-new)) will block forever.

The events produced by (super-docs-evt) can be recognzed by the predicate super-docs-evt? and satisfy the contract (evt/c (instance-set/c tei-document?)).

Examples:
> (define printing-corpus-mixin
    (corpus-mixin [] []
      (super-new)
      (printf "These are the docs!\n  ~v\n"
              (set->list (super-docs)))))
> (new (printing-corpus-mixin corpus%))

These are the docs!

  '()

(wrapper-object:printing-corpus-mixin ...)

procedure

(super-docs-evt? v)  any/c

  v : any/c
Recognizes values produced by super-docs-evt.

syntax

(corpus-mixin+interface [from<%> ...] [to<%> ...]
  interface-decl
  mixin-clause ...+)
 
interface-decl = 
(interface (super<%> ...)
  interface-method-clause ...)
  | 
(interface* (super<%> ...)
            ([prop-expr val-expr] ...)
  interface-method-clause ...)
     
interface-method-clause = method-id
  | [method-id contract-expr]
 
  from<%> : interface?
  to<%> : interface?
  super<%> : interface?
  prop-expr : struct-type-property?
  contract-expr : contract?
Like corpus-mixin, but evaluates to two values, a mixin and an assosciated interface.

...

Most clients should use the higher-level define-corpus-mixin+interface, rather than using corpus-mixin+interface directly.

syntax

(define-corpus-mixin+interface name-spec
  [from<%> ...] [to<%> ...]
  interface-decl*
  mixin-clause ...+)
 
name-spec = base-id
  | [id-mixin id<%>]
     
interface-decl* = 
(interface (super<%> ...)
  interface-method-clause* ...)
  | 
(interface* (super<%> ...)
            ([prop-expr val-expr] ...)
  interface-method-clause* ...)
     
interface-method-clause* = interface-method-clause
  | ext-method-clause
     
interface-method-clause = method-id
  | [method-id contract-expr]
     
ext-method-clause = [ext-clause-part ...]
     
ext-clause-part = method-definition-form ; required
  | #:contract contract-expr
  | #:proc proc-id
  | with-current-decl
     
method-definition-form = 
(define/method (method-id kw-formal ...)
  body ...+)
     
define/method = define/public
  | define/pubment
  | define/public-final
     
with-current-decl = 
#:with-current with-current-id
#:else [else-body ...+]
  | 
#:with-current/infer
#:else [else-body ...+]
 
  from<%> : interface?
  to<%> : interface?
  super<%> : interface?
  prop-expr : struct-type-property?
  contract-expr : contract?
If no ext-method-clause appears, equivalent to:
(define-values [id-mixin id<%>]
  (corpus-mixin+interface [from<%> ...] [to<%> ...]
    interface-decl*
    mixin-clause ...+))
except that define-corpus-mixin+interface can often produce better inferred value names. If name-spec is given as a single base-id, identifiers are synthesized with the suffixes -mixin and <%> using the lexical context of base-id.

The ext-method-clause variant extends the grammar of interface and interface* to support defining functions related to one of the interface’s methods:
  • ...

interface

corpus<%> : interface?

  implements: corpus%
Equivalent to (class->interface corpus%). Note that corpus% implements lexically-protected methods (see define-local-member-name), so corpus<%> can only be implemented by inheriting from corpus%: corpus<%> is provided primarily as a convienience for writing derived interfaces, mixins, and contracts.