On this page:
tei-document->excerpt-max-allow-chars
searchable-document-set<%>
do-term-search
normalized-term?
normalized-term-string
pregexp-quote-normalized-term
10.2.1 Constructing Search Results
segment-make-search-results
search-result-nullify-excerpt
make-document-search-results
10.2.2 Implementing Search Backend Types
search^
search-backend/  c
initialize-search-backend
initialize-search-backend/  c
define-compound-search-unit
define-lazy-search-unit
10.2.2.1 Basic search^ Units
0.5.91

10.2 Search Implementation🔗ℹ

 (require (submod ricoeur/term-search/backend/common private))

This section documents the common utilities used to implement ricoeur/term-search’s search feature (which is used through functions like term-search), including everything necessary to implement new kinds of search backends.

Returns the maximum total number of characters allowed in search result excerpts in a document search result value for the TEI document doc. This maximum applies per search query; it is defined internally as a percentage of the total length of doc.

The result of tei-document->excerpt-max-allow-chars is cached to amortize the cost of calling it multiple times on the same TEI document.

Internally, a searchable document set is an object (in the sense of racket/class) that satisfies (is-a?/c searchable-document-set<%>).
Adding a new type of search backend typically means defining a new class that implements this interface.

method

(send a-searchable-document-set do-term-search 
  norm-term 
  #:ricoeur-only? ricoeur-only? 
  #:languages languages 
  #:book/article book/article 
  #:exact? exact?) 
  (instance-set/c document-search-results?)
  norm-term : normalized-term?
  ricoeur-only? : any/c
  languages : (set/c language-symbol/c #:cmp 'eq #:kind 'immutable)
  book/article : (or/c 'any 'book 'article)
  exact? : any/c
The method used to implement searchable-document-set-do-term-search.

There are a few notable differences between do-term-search and all of the higher-level search functions or methods:
  1. All of the keyword arguments are mandatory. As there are several different classes that implement searchable-document-set<%>, copying the default values correctly to every definition would be unpleasant and error-prone.

  2. The search term is passed as a normalized term value, rather than a string satisfying term/c. This prevents do-term-search from being called except by searchable-document-set-do-term-search, which allows the implementation of searchable-document-set-do-term-search to rely on the fact that it will be able to interpose on calls. In fact, the implementation does do some normalization when constructing a normalized term value, and it can guarantee that it will always have the chance to do so.

  3. The languages argument is normalized: rather than being passed as a search-languages/c value, which is designed for the convienience of clients, it is given as an immutable set of language-symbol/c symbols. This allows searchable-document-set-do-term-search to take sole responsibility for handling 'any and lists with duplicate symbols, rather than placing that burden on every class that implements searchable-document-set<%>.

procedure

(normalized-term? v)  any/c

  v : any/c

procedure

(normalized-term-string norm-term)

  (and/c term/c trimmed-string-px)
  norm-term : normalized-term?

procedure

(pregexp-quote-normalized-term norm-term 
  #:exact? exact?) 
  string-immutable/c
  norm-term : normalized-term?
  exact? : any/c
A normalized term value, recognized by the predicate normalized-term?, is used by searchable-document-set-do-term-search to wrap the search term it passes to the do-term-search method of searchable-document-set<%>. The function normalized-term-string extracts the string from a normalized term.

The function pregexp-quote-normalized-term produces a string suitable to be passed to pregexp to construct a regular expression recognizing the encapsulated term. (Some backend implementations combine the resulting string with additional regular expression syntax.) When exact? is non-false, the resulting string will produce a regular expression that will match only exact occurances of the term delimited by a word boundry. (The precise definition of a word boundry is unspecified and specific to pregexp-quote-normalized-term.)

Because the constructor for normalized term values is not exported, the wrapper can serve as a guarantee of some invariants: for example, that the argument to pregexp-quote-normalized-term will always have been normalized. This is particularly important as certain properties of search strings can have security implications, especially with less sophisticated backends.

10.2.1 Constructing Search Results🔗ℹ

procedure

(segment-make-search-results seg excerpts)

  (listof search-result?)
  seg : segment?
  excerpts : 
(listof (maybe/c (and/c string-immutable/c
                        #px"[^\\s]")))
Returns a search result value for each element of the excerpts list. The excerpts should be given in the order in which they occur within the segment.

procedure

(search-result-nullify-excerpt result)  search-result?

  result : search-result?
Returns a search result like result, but which will return (nothing) as its search-result-excerpt.

procedure

(make-document-search-results info results)

  document-search-results?
  info : instance-info?
  results : (non-empty-listof search-result?)
Constructs a document search results value encapsulating the results.

All of the results must be from the same TEI document and must be consistent with the instance info value info. Otherwise, an exception is raised.

10.2.2 Implementing Search Backend Types🔗ℹ

signature

search^ : signature

Adding support for a new type of search backend means defining a unit that exports search^. The units for each basic type of search backend are then knit together using define-compound-search-unit and define-lazy-search-unit to create a composite unit which binds search-backend/c and initialize-search-backend via define-values/invoke-unit/infer.

A search^ unit should define search-backend/c as a contract recognizing the new type of search backend value it wants to support.

A search^ unit’s search backend implementation need only provide a basic contract and initialize it eagerly in initialize-search-backend. The additional variants permitted by the final, public search-backend/c (see lazy+eager-search-backend/c) are added using define-lazy-search-unit.

procedure

(initialize-search-backend backend docs)

  searchable-document-set?
  backend : search-backend/c
  docs : (instance-set/c tei-document?)
The search^ unit’s initialize-search-backend will be called with a backend search backend value satisfying the unit’s specific definiton of search-backend/c. The unit’s implementation of initialize-search-backend is responsible for returning a searchable document set: that is, an instance of a class that implements searchable-document-set<%>.

Typically, initialize-search-backend will be a wrapper around a constructor for a unit-specific searchable-document-set<%> class, and the unit’s notion of a search-backend/c value will to encapsulate all of the other data needed to initialize the class.

However, this is not mandatory. The implementation of initialize-search-backend from noop@, for example, ignores its arguments and always returns the singleton object noop-searchable-document-set.
The search^ signature uses define-values-for-export to define initialize-search-backend/c as the contract for that unit’s implementation of initialize-search-backend.

syntax

(define-compound-search-unit compound-search-unit-id
  member-search-unit-id ...+)
Defines compound-search-unit-id as a unit exporting the signature search^.

The new unit’s implementation of search-backend/c applies or/c to the implementations from each of the member-search-unit-id units. Likewise, the new unit’s implementation of initialize-search-backend inspects the given search backend value and dispatches to the implementation of initialize-search-backend from the coresponding member-search-unit-id unit.

syntax

(define-lazy-search-unit lazy-search-unit-id
  eager-search-unit-id)
Defines lazy-search-unit-id as a unit exporting the signature search^. The new lazy-search-unit-id will define search-backend/c as (lazy+eager-search-backend/c base/c), where base/c is the search-backend/c implemented by eager-search-unit-id.

In the 'eager case, lazy-search-unit-id will simply dispatch to eager-search-unit-id’s implementation of initialize-search-backend. Otherwise, lazy-search-unit-id will return a proxy searchable document set which calls eager-search-unit-id’s initialize-search-backend in a background thread.

10.2.2.1 Basic search^ Units🔗ℹ

value

noop@ : 
(unit/c (import)
        (export search^))

value

regexp@ : 
(unit/c (import)
        (export search^))

value

postgresql@ : 
(unit/c (import)
        (export search^))