6 Explanation

9.1

Racket

6 Explanation🔗ℹ

Discursive background — why Stone is shaped the way it is, what trade-offs the design makes, where constraints come from. Read these when you want to understand a design choice or when a how-to links here for the why.

6.1 Ashlars🔗ℹ

In Stone, everything is an ashlar.

That sentence is the whole design. A Stone pipeline has exactly one kind of moving part, and every part — the deterministic glue, the LLM call, the multi-turn agent, the prompt that waits on a human, the parser, the gate that approves a change — is the same kind of thing. There is one layer. There are ashlars, and there is the DAG they share.

This document explains what an ashlar is, why the framework is built around that one unit, and what the shape buys for pipelines that mix language-model work with ordinary code.

6.1.1 The atomic unit🔗ℹ

Informally, an ashlar is a self-contained piece of work. It reads whatever it needs from a growing DAG of prior results, does something, and produces a new node. It does not take arguments from its callers. It does not return values to them. Its only input is the DAG, and its only output is the node it appends.

Formally, an ashlar is a function (DAG -> DAG) wrapped in a ashlar-meta struct that carries the metadata composition and validation need. The full field list is in the reference for stone/edge; the one detail worth surfacing here is the line that makes composition possible:

#:property prop:procedure 0

An ashlar-meta is callable. When you apply one to a DAG, it runs the wrapped function and hands back a DAG. The metadata stays visible to anything that wants to inspect the value — sequence primitives, validators, the logging machinery — while the ashlar behaves like a plain procedure to whoever is running it.

That wrapping is the reason composition can be more than function application. A sequence of ashlars can be statically checked for a missing query before it ever runs, because the queries are visible on the struct. A loop can name its body in logs because the body carries a name. A validator can walk the children of a composite ashlar and reason about the whole topology, dispatching through each ashlar’s own walk rule to ask how it wants its children read. Metadata that would be lost to a bare lambda is preserved here, and every primitive in the framework assumes it’s there.

6.1.2 The DAG as shared state🔗ℹ

Ashlars don’t pass arguments to each other. A sequence of ashlars is not a pipeline of functions feeding each other’s return values. It’s a pipeline of functions taking turns writing into a shared, append-only DAG — the single source of truth for the run.

Every node in the DAG has a node type (a symbol). That symbol is how ashlars find each other’s work. An ashlar declares, as part of its metadata, the node type it produces and the node types it queries:

(make-ashlar analyze-requirements
  #:queries  '(project-config)
  #:produces 'behaviors
  #:name     'analyzer)

At run time, the ashlar reaches into the DAG with dag-nearest-ancestor or dag-query-all, pulls the singleton on its current lineage (or every node) of the type it needs, and returns a typed node tagged with the type it promised to produce. The runtime appends that node, heads advance, the next ashlar sees it. Ashlars typically build their output with typed-node, a convenience that defaults the parents to (dag-heads dag) — the DAG frontier the ashlar was handed — so the common case reads as (typed-node dag 'my-type content) without any make-typed-node boilerplate.

Nothing about this coupling is nominal. Ashlars never import each other, never call each other, never know each other’s names. They agree on node types, and the DAG mediates. The effect is that pipelines are recombinable in ways that argument threading would forbid: any ashlar that produces 'behaviors can feed any ashlar that queries 'behaviors, and swapping implementations is a matter of replacing one producer with another.

The cost is that coordination lives in the type vocabulary. Two ashlars that mean different things by 'config will quietly disagree. The payoff is that with a disciplined vocabulary, every ashlar is independently testable: hand it a DAG with the right nodes, and it runs.

6.1.3 Why everything is an ashlar🔗ℹ

Once the atomic unit exists, the temptation is to reserve it for the interesting work — the LLM calls, the agent loops — and fall back to ordinary Racket for the boring parts. Stone doesn’t do that. Parsing a file, copying a value, asking a human a question, running a build step, gating on an approval: all of it is expressed as an ashlar.

The reason is uniformity, in three registers.

Uniform composition. Every ashlar can be sequenced with ~>, wrapped in an ashlar-loop, fanned out with ashlar-map or ashlar-parallel, or branched on with ashlar-match. There is no second calling convention for "the pieces that aren’t ashlars yet." A deterministic parser and a six-turn tool-using agent slot into the same sequence with the same primitive.

Uniform observation. Every ashlar emits the same lifecycle events — 'ashlar-start, 'ashlar-end, and the typed events its own implementation raises. Tracing a run means reading one stream with one vocabulary.

Uniform validation. Because the metadata is present on every ashlar, a pipeline can be walked before it runs: every queries declaration can be checked against the set of available produces declarations upstream, and a pipeline with a broken coupling can be refused without a single LLM call.

6.1.4 The constructors🔗ℹ

There are two user-facing constructors, both built on a single internal factory.

make-ashlar wraps a (DAG -> node) function. It’s the ergonomic form for atomic work: the user writes "given this DAG, here is the node I want to add," and the adapter appends that node to the DAG on the user’s behalf. Use it for deterministic work — reading configuration, parsing files, running subprocesses, producing failure nodes when preconditions fail. Anything that can look at a DAG and return a single node is a candidate.

make-agent-ashlar runs a multi-turn agent loop as a single ashlar. It’s itself built on top of make-ashlar: the loop, the middleware onion, and the final node construction all happen inside one (DAG -> node) body, so the result composes like any other atomic ashlar. It accepts a caller, middleware, and a decision function, drives the conversation up to #:max-turns, and produces a single node at the end. For single-shot prompts, pass #:max-turns 1 with an empty middleware list and let the default continue-on-tool-use run one pass.

Underneath both sits make-scoped-ashlar, the single factory that manufactures every ashlar in the framework — atomic ashlars and composites alike. It takes a (DAG -> DAG) body plus a walk rule and returns an ashlar-meta. The composition primitives (~>, ashlar-loop, ashlar-match, ashlar-map, ashlar-parallel, ashlar-reduce) each call it with their own walk rule, which is what lets the validator classify a composite by walking the tree and asking each node how it wants to be read rather than by reading a tag. Most users never call make-scoped-ashlar directly — it’s there for building genuinely new composition primitives or for embedding an ashlar in an unusual scoping arrangement.

All three return an ashlar-meta. All three are ashlars. The distinction is how they do their work, not how they compose.

6.1.5 Failure as a value🔗ℹ

When an ashlar can’t complete its job, it doesn’t raise. It returns a failure node, built by make-failure-node, which is a typed node with node-type equal to 'failure and a payload carrying kind and reason. That node becomes the latest head of the DAG the ashlar returns, and from outside the ashlar, failure is simply a DAG state: (dag-failed? dag) is true when the most recent head is a failure node.

The composition primitives short-circuit on that state. ~> checks dag-failed? after each child and stops at the first failure, threading the failing DAG outward. ashlar-loop exits. ashlar-map drops the failed lane from its result set. The contract is uniform because the detection site is uniform: every primitive invokes a child as DAG -> DAG and asks a single question of the result.

Authors using make-ashlar still return a node; the adapter appends it and the resulting DAG carries the failure through. make-agent-ashlar produces a failure node when the loop exhausts its turn budget, when the adversary rejects and healing runs out, or when a middleware refuses. Either way, the shape the outside sees is a DAG whose latest head is a failure.

This isn’t exception handling dressed up as values — it’s in-band control flow. A failure node is a real node in the DAG, with parents pointing to the context where the failure happened, and everything downstream can inspect it, log it, branch on it, or try to heal from it. make-agent-ashlar’s #:adversary and #:heal-with pair is exactly this pattern at the conversation level: the adversary inspects the agent’s draft and returns either a pass or a failure node, and on rejection the healer feeds extra context into the agent’s next turn. Between-iteration repair at the pipeline level is folded into the body of an ashlar-loop (a sequence or match inside the loop) so the repair ashlar is an ordinary sibling, not a special slot.

6.1.6 What this shape buys🔗ℹ

Pull the threads together. The unit is an ashlar. The medium is a typed DAG. The coupling is by node type, not by function reference. Composition primitives work on every ashlar uniformly. Metadata lets the framework reason about a pipeline before running it. Failure is a value the primitives understand.

For pipelines that mix language-model work with ordinary code, this matters because the expensive, uncertain parts — the LLM calls, the agent loops — need to be embedded in a scaffold that can be inspected, retried, validated, and traced. Making the scaffold and the LLM step the same kind of object means the scaffold never has to grow a second set of rules to accommodate the non-deterministic parts. There is one layer, and the hard cases live in it.

6.2 The DAG as Pipeline State🔗ℹ

The DAG is the pipeline’s memory.

Every ashlar in a running Stone pipeline reads from one DAG and writes to the same DAG. There is no other channel. No global variables thread between ashlars, no return values flow from one call to the next, no hidden context follows the work around. When an ashlar needs information, it queries the DAG. When an ashlar produces a result, it appends a node. The DAG is the single source of truth for the run.

This doc explains what kind of object that DAG is, how ashlars find each other’s work inside it, and why the framework’s coordination model is a typed, append-only graph instead of something simpler.

6.2.1 Why a DAG instead of a pipeline of arguments🔗ℹ

The simplest alternative is the one most function pipelines reach for: each stage takes the previous stage’s output as an argument, and composition is function composition. That shape is fine when the work is a straight line and every stage consumes exactly what the previous stage produced.

Stone pipelines are rarely that shape. A later ashlar often needs a result from three ashlars back, not one. A branch forks, and the two halves still want to see the configuration loaded before the fork. A loop produces several candidate outputs, and a downstream reducer wants every one. A human-approval ashlar needs to display the whole context the pipeline has built so far.

All of that is awkward with argument threading — the call tree becomes load-bearing and ashlars grow long argument lists just to relay data they don’t use. A shared DAG removes the problem. Any ashlar can reach backward for any node it needs. The coupling between ashlars is not "who called whom with what" but "who produced what type, and who queries for that type." Ashlars become independently testable, recombinable, and trivially swappable as long as they agree on their type vocabulary.

6.2.2 Nodes are typed🔗ℹ

Every node in the DAG carries a node type symbol. That symbol is the primary coordination vocabulary — the public interface between ashlars.

(make-typed-node (dag-heads dag) 'requirement
(hasheq 'text "A Fibonacci class with memoization"))

An ashlar that produces requirements declares #:produces 'requirement in its metadata and, at the end of its work, hands back a node tagged with that type. An ashlar that wants to read requirements declares #:queries '(requirement) and calls dag-nearest-ancestor on the type. Nothing else in the framework needs to know either ashlar’s name.

The content of a typed node is typically a hash — a hasheq carrying the fields the consumer will want. The shape of that hash is a contract between producer and consumer, but it’s not enforced by the DAG itself; the DAG only cares about the type symbol. Keeping type symbols meaningful is the discipline that holds a pipeline together: 'test-proposal is a better type than 'output, because the next ashlar’s query reads like a question the designer actually asked.

make-typed-node is the explicit constructor; the usual form is typed-node, which defaults the parents to the DAG’s current heads. Day-to-day ashlar authoring uses typed-node.

6.2.3 Querying by type, answered structurally🔗ℹ

The two normal reads from the DAG are dag-nearest-ancestor and dag-query-all. Both are type-based — neither asks the caller to know which specific ashlar produced which node. They differ in how they navigate the graph to answer the question.

dag-nearest-ancestor walks first-parent pointers from the DAG’s most recently appended head until it finds a node of the requested type. The answer is "the node of this type on my current execution lineage." For sequential pipelines that’s the most recent producer’s node, same as a timestamp scan would give. For branched work the structural answer is the one you want. Two lanes of a ashlar-map or ashlar-parallel produce sibling nodes that don’t share a first-parent line; the walk stays on the current lane and doesn’t cross into a sibling’s output. Loop iterations accumulate multiple nodes of the same type as peers in the outer DAG, and the walk follows the lineage through this iteration’s node back toward the common ancestor — not whichever peer happens to have the latest timestamp.

dag-query-all returns every node of a type, sorted oldest-first by timestamp. It’s the right tool for reductions, for counting, for collecting sibling-lane outputs after a fan-out, and for any situation where "the full set" is the question.

Prefer dag-nearest-ancestor for singleton lookups; prefer dag-query-all for set-shaped questions. The two cover the reads framework callers actually perform.

Composites pass through the outer frontier transparently. Every composition primitive (ashlar-loop, ashlar-match, ashlar-map, ashlar-parallel, ashlar-reduce, and the sequence form ~>) runs its body inside a labelled sub-DAG that is seeded with the outer scope’s heads and nodes. A body ashlar calling dag-nearest-ancestor sees the same frontier the surrounding pipeline does from its first instruction; dag-query-all returns every visible node regardless of which scope produced it. The historical #:scope argument on dag-query-all is retained as a no-op for source compatibility — there is only one search space now.

6.2.3.1 Projection helpers🔗ℹ

Once you have the node, three helpers keep the read concise. node-get pulls one field out of a node’s content hash with a default for the missing-node and missing-key cases. node-get* walks a nested path without hash-ref chaining. node-text extracts the text representation — a raw string, a hash’s 'text field, or a formatted fallback — from LLM-ashlar outputs whose content may be either shape.

(define req (dag-nearest-ancestor dag 'requirement))
(define text (node-get req 'text ""))

(define cfg (dag-nearest-ancestor dag 'project-config))
(define region (node-get* cfg 'deployment 'region))

(define reply (dag-nearest-ancestor dag 'summary))
(displayln (node-text reply))

The helpers are shape-safe: node-get on a missing cfg returns the default, node-get* short-circuits to #f at any level that isn’t a hash, and node-text returns "" when the node is absent. That means your ashlar body doesn’t need to guard the dag-nearest-ancestor call separately — you can treat "no node of this type upstream" and "no field in the content" the same way through a default.

6.2.4 Content-addressed and append-only🔗ℹ

A node’s identity is a deterministic hash of its structural ingredients — parents, type, and content. Two nodes with the same parents, type, and content have the same ID. Identical deterministic work produces identical identity, which means the second append of the same result is a no-op at the level of identity.

The DAG is append-only. Once a node is in it, it’s never mutated and never removed. dag-append adds a node and rewrites the head set: the incoming node’s parents drop out of the heads and the new node becomes one. Nothing else changes. No ashlar can overwrite another ashlar’s work, because there is no overwrite operation.

These two properties compound. Content-addressing means DAGs can be compared structurally, merged without identity collisions, and reproduced from their node set alone. Append-only means a pipeline run leaves behind a complete record — every result produced, every failure encountered, each with a stable identity and a clear place in the graph. A run is not just a cause of effects; it’s an artifact.

6.2.5 Branching and merging🔗ℹ

Fan-out in Stone happens through ashlar-map and ashlar-parallel. Both take the DAG at the fan-out point and hand each lane a snapshot of it. Lanes don’t see each other’s work: a behavior being implemented in lane three can’t query a half-finished behavior from lane seven. After every lane has run, the surviving result nodes are appended to the outer DAG.

The DAG’s structure naturally represents that topology. A branch point is a node with multiple children — sibling lanes descending from the same parent. A merge point is a node with multiple parents — a downstream node whose parents field lists every lane’s head. No special primitives, no special node shapes; just the ordinary graph expressing the ordinary fact that parallel work converges.

For cases where an ashlar needs to see a coherent linear history through a branched graph — an LLM ashlar that wants "the conversation so far" as a single thread, say — there’s dag-select. It walks first-parent pointers from a starting node back to the root and returns the resulting list oldest-first. The walk ignores sibling branches; it picks one thread and follows it. That’s enough to reconstruct a linear view even when the underlying DAG has fanned out and merged several times.

6.2.6 Failure nodes🔗ℹ

When an ashlar can’t complete its job, it returns a failure node: a typed node whose type is 'failure and whose content is a small hash with a kind symbol and a reason string. Failure is a value, not an exception — the composition primitives recognize the shape and honor it, and downstream ashlars can inspect, branch on, or heal from it exactly as they would any other node.

One subtlety is worth spelling out. A atomic ashlar built with make-ashlar is wrapped by the framework so that non-failure nodes are appended to the DAG and failure nodes are not. A failing atomic ashlar returns the failure node paired with the unchanged DAG — the failure is in flight, but not yet part of the graph’s history. This is consistent with treating failure as a value: an ashlar that didn’t succeed doesn’t record its partial work. Composition primitives that want to record a failure can do so explicitly (a healer running inside a ashlar-loop, for instance, sees the failing node as a value and can decide what to do with it), but the default posture is that failures flow through the interfaces without polluting the DAG.

6.2.7 What this shape buys🔗ℹ

Pull the threads together. A typed, content-addressed, append-only DAG gives ashlars a way to coordinate through types instead of function references, a way to reach backward without threading arguments, a structural record that can be traced and replayed, and natural branching and merging without special runtime machinery.

For pipelines that mix language-model work with ordinary code — where the expensive, unpredictable parts need to be embedded in a scaffold that can be inspected, retried, and validated — this shape means every run leaves behind a complete, content-addressed audit trail: every decision, every output, every failure is a node with a stable identity and a clear place in the graph.

The DAG is not a side-effect of running the pipeline. It is what the pipeline is for.

6.3 Edge Primitives🔗ℹ

An ashlar is the unit. An edge primitive is how two or more ashlars become one.

Stone’s composition is a small, deliberate set: ~>, ashlar-loop, ashlar-match, ashlar-map, ashlar-parallel, ashlar-reduce, and — at the edge of the vocabulary — make-ask-human. Each takes ashlars and returns an ashlar. That closure under composition is why a pipeline can be deeply nested and still be one layer of the framework.

This doc is about shapes, not steps. It assumes you have read Ashlars.

6.3.1 Why primitives at all🔗ℹ

Racket has lambdas and higher-order functions. Nothing about Stone’s runtime strictly requires a primitive for sequence, or for loop, or for match. So why the vocabulary?

Because ashlars carry metadata, and the framework earns its keep by keeping that metadata visible all the way up. Every composition primitive produces a ashlar-meta with its children enumerated, its aggregate produces computed, its external queries computed (queries not satisfied by earlier siblings), and a walk rule — a function the validator calls to read the children in a shape that matches what happens at runtime. Sequences expose their children in order; match exposes them as sibling branches; loops expose their body as a self-referential reader-and-writer of its own produces; reducers declare that they close a fan-out. The walk rule is not a tag the validator switches on — it is the rule itself, and the validator classifies a composite by whose walk rule is attached. A bare lambda would erase all of that. Primitives are the price Stone pays for refusing a broken pipeline without calling any LLM, and for emitting one vocabulary of events across every kind of composite.

6.3.2 ~> — the sequence🔗ℹ

Sequence is the default. It’s what you reach for when the first ashlar’s work should feed the next, with the coupling mediated by node types on the DAG rather than by argument passing.

(~> load-config
    analyze-requirements
    generate-tests
    run-build)

~> runs each ashlar against the DAG threaded from the previous one. A non-failure node has already been appended by make-ashlar’s wrapper, so the next ashlar sees the grown DAG. A failure stops the sequence immediately. The aggregate metadata is the union of children’s produces and the queries that remain unsatisfied after walking the children in order — which is what lets the validator say "this sequence still needs a 'project-config upstream" without running anything.

Most of a pipeline is sequence. The other primitives are for the moments where a straight line is not enough.

6.3.3 ashlar-loop — bounded repetition with a predicate🔗ℹ

Sometimes one pass is not enough. A test suite fails and needs to be regenerated against the build output. An agent’s draft answer doesn’t satisfy a check. ashlar-loop turns "try, check, try again" into a composable unit.

(ashlar-loop generate-and-test
#:until tests-pass?
#:max 5)

The body runs, the predicate inspects the loop’s accumulated DAG, and if the predicate is satisfied the loop returns. Otherwise it repeats — up to #:max times. #:max is non-negotiable: loops are always bounded, and an exhausted loop produces a 'loop-exhausted failure.

The #:until predicate’s signature is (dag? -> boolean) — it receives the loop’s accumulated DAG, not just the latest body output. That widening exists because termination logic often depends on cumulative state, not the most recent node alone. "Have all expected behaviors completed?" is a question about the whole loop history; it wants to walk ancestors with dag-collect-until or dag-nearest-ancestor and fold across them. Constraining the predicate to a single node would force that cumulative work into a side channel — either an accumulator node the body writes on every iteration, or an out-of-band ref-cell the loop closes over — and either dodge sacrifices the property that everything visible to the predicate is also visible to validators, tracers, and later ashlars.

When the predicate genuinely only needs the most recent body output, on-latest makes that intent explicit at the call site: (ashlar-loop body #:until (on-latest tests-pass?) #:max 5) reads as "this is a head-only check," and the surrounding code stays honest about what state the termination depends on.

The cross-iteration property worth naming: iteration N+1’s body runs against a DAG that already contains every node iteration N produced. A leaf inside the body can reach for dag-nearest-ancestor and find last iteration’s work through the parent-scope inheritance — no extra wiring. Loops compose with themselves for free; a nested ashlar-loop inside a ashlar-loop body works because each iteration of the outer loop sees the inner loop’s full history and each iteration of the inner loop sees the outer’s.

For between-iteration repair work — asking a human, clearing a stale file, fetching missing context — fold the repair into the body. The most common shape is a sequence: (~> try-to-produce heal-if-needed). heal-if-needed can itself be a ashlar-match that branches on the last node’s content, so the healer runs only when the try actually failed. That keeps the repair in the topology as an ordinary sibling ashlar rather than as a privileged slot.

For exact signatures of ashlar-loop and on-latest, see stone/edge. For a worked example of a #:until predicate that folds across iteration history, see Route on a node from earlier in the pipeline.

6.3.4 ashlar-match — conditional branching🔗ℹ

When the next thing to do depends on what the pipeline has produced so far, you reach for ashlar-match. It’s a macro, and the macro shape matters: branches are collected at expansion time into a hash-table dispatch.

(ashlar-match (lens-extract classification-lens)
  ['refactor  refactor-pipeline]
  ['feature   feature-pipeline]
  ['bugfix    bugfix-pipeline])

ashlar-match’s purpose is to branch on a chosen projection of DAG state. The extractor decides what to project — the latest head’s classification field, the kind of a 'classification ancestor several steps back, a fold across many nodes — and the branch table routes on the projected value. The selected branch — itself an ashlar — then runs against the same DAG.

There is no default branch. An extracted value with no matching entry produces a 'match-failed failure, as does dispatching with no node to match on. The absence of a default is deliberate: a match enumerates an expected space, and an unexpected value is a real failure worth surfacing.

The macro shape is why the primitive exists. With a function, branches would be runtime data — invisible to a validator. With a macro, the children are a static list, and the validator can walk them, compute their produces and queries, and verify each branch is internally coherent before anything runs.

6.3.4.1 Extractors as projection over DAG state🔗ℹ

The extractor API is two-layer, and the layering exists for a specific reason: the common case wants to stay terse, but the constraint that produced the common case isn’t actually intrinsic to what match is for.

A lens? extractor is sugar for the dominant case: "project from the latest head’s content at this path." Use it when match runs immediately after a classifier and the value to route on lives in the classifier’s output.
A procedure extractor receives the work DAG and chooses where to look. Use it when the classification you want to route on lives elsewhere — past intermediate work, across a loop boundary, or folded out of multiple nodes. The signature is (dag? -> any/c), and the body is free to call dag-nearest-ancestor, dag-collect-until, or any other DAG navigator.

Earlier versions of Stone forced classifiers to be immediately upstream of match because the extractor was handed only the latest head. That constraint was an artifact of the API, not a property of what match is for — match has always been about branching on a projection, and "the latest head’s content" is just the most common projection. The wider procedure form lifts that constraint without disturbing the lens sugar: the common case stays terse, the wider case becomes possible, and on-latest is the explicit annotation when a procedure predicate happens to only need the latest head. The signature reads "DAG in, value out" everywhere; the lens form is recognized and unwrapped at construction time so the runtime path is one shape regardless of which form the author wrote.

For exact signatures, see stone/edge. For a worked example of routing on a non-head ancestor, see Route on a node from earlier in the pipeline.

6.3.5 ashlar-map — data-dependent fan-out🔗ℹ

ashlar-map is for the case where topology depends on data. An earlier ashlar produced a list of items, each one deserves the same work done to it, and the number of items is not known in advance.

(ashlar-map (lens-extract behaviors-lens)
implement-behavior)

The extractor pulls a list from the most recent DAG head. For each item, ashlar-map takes a snapshot of the DAG, places a synthetic 'map-item node carrying the item into that snapshot, and runs the body against that lane-local DAG. Lanes don’t see each other’s work: a behavior being implemented in lane 3 can’t query a half-finished behavior from lane 7.

After every lane has run, their non-failure result nodes are appended to the outer DAG. Failing lanes are dropped from the final DAG, but they are not silent — they were logged through the same lifecycle events as any other ashlar. If the extractor returns an empty list, ashlar-map produces a 'map-empty failure: fanning out over nothing is treated as a real failure, because a pipeline that expected work and found none almost always has a bug upstream.

Every ashlar-map must be paired with a ashlar-reduce to collapse its lane results into a single node — typically the next step in the enclosing ~> sequence, calling dag-query-all for the lane output type and producing an aggregate. The validator refuses a ashlar-map whose output could reach a downstream ashlar without an intervening ashlar-reduce.

6.3.6 ashlar-parallel — static fan-out🔗ℹ

Where ashlar-map is shaped by data, ashlar-parallel is shaped by topology. The number of lanes is a compile-time decision — you wrote them.

(ashlar-parallel
  lint-code
  type-check
  run-tests)

Each lane runs against the same pre-fan-out DAG, isolated from the others exactly as in ashlar-map. After all lanes have run, their non-failure nodes are appended to the outer DAG together.

The difference between the two primitives is not implementation — both are fan-outs with lane isolation. The difference is the question each one answers. ashlar-map asks how many and lets the DAG decide. ashlar-parallel asks which and writes the answer into the pipeline source.

Every ashlar-parallel must be paired with a ashlar-reduce — same rule as ashlar-map and for the same reason. The fan-out’s first return value is otherwise whichever lane was last in source order, which silently couples downstream behavior to an arbitrary choice.

6.3.7 ashlar-reduce — the semantic marker🔗ℹ

ashlar-reduce is the thinnest primitive in the set. It takes an ashlar and wraps it with a walk rule that marks it as a reducer. That’s essentially all it does at runtime — the body ashlar runs against the DAG unmodified.

(ashlar-reduce synthesize-report)

It exists because an ashlar that queries dag-query-all for every node of some type and rolls them into a single summary is doing a reduction, and naming it one matters. Validators, log viewers, and humans reading topology code all benefit from seeing ashlar-reduce at the moment a fan-out collapses back into a single node. The primitive is a label on the topology more than a runtime mechanism.

6.3.8 make-ask-human — the topology-level ask🔗ℹ

make-ask-human is not a composition primitive; it’s a ashlar constructor. It lives in this document because human interaction, in Stone, is a topology concern. The place where a pipeline waits on a person is an ashlar, composed into sequences and loops and matches like any other.

(define ask-approval
  (make-ask-human channels
    #:format-fn format-approval-question
    #:name 'approval
    #:produces 'approval-decision))

The constructor takes a channel bundle (out, in, cancel) and a #:format-fn that builds the question from the current DAG. At runtime the ashlar writes the question out, blocks on either an answer or a cancel signal, and produces a typed node carrying the response. Upstream, the channels are wired to a TUI, a CLI, a webhook, or whatever the embedding application provides. For channel wiring and embedding details, see Ask Human.

6.3.9 How they compose🔗ℹ

Every primitive returns an ashlar. Every ashlar composes with every primitive. That one-sentence property is why a real Stone pipeline looks like nested parentheses going several levels deep without ever leaving the vocabulary:

(~> load-config
    classify-request
    (ashlar-match (lens-extract classification-lens)
      ['feature
       (ashlar-map (lens-extract behaviors-lens)
         (ashlar-loop
           (~> generate-test run-test interpret-result)
           #:until test-passes?
           #:max 3))]
      ['bugfix
       (~> locate-bug write-regression-test apply-fix)])
    summarize-work)

Every node in that tree is a ashlar-meta, and every one can be queried for its produces, its queries, its children, its walk rule, its name. A validator can walk the structure from the root and answer: is every query satisfied by an upstream produces? Are the branches of the match internally coherent? Does the loop body produce a node the predicate can inspect? The walk rule at each node tells the validator how that node wants its children read, so the single driver handles every composite without a type switch. None of that requires running a single ashlar.

6.3.10 Failure propagation🔗ℹ

A Stone pipeline never throws when work fails — a failing ashlar returns a DAG whose latest head is a failure node, and the primitives know how to read that.

The detection site is uniform. Every primitive invokes a child as (DAG -> DAG) and then asks (dag-failed? result); if the answer is yes, the failure threads outward. No primitive unpacks a separate "did this succeed" channel, because there is no separate channel. The DAG is the only return value, and the failure state is visible on it.

~> stops at the first child whose returned DAG is failed and threads that DAG out.
ashlar-loop exits immediately when the body’s returned DAG is failed. Exhausting #:max produces a fresh 'loop-exhausted failure.
ashlar-match produces a 'match-failed failure when there is no node to match on or when the extracted value matches no branch. A branch that itself returns a failing DAG passes through unchanged.
ashlar-map and ashlar-parallel drop failing lanes from the rolled-up DAG. ashlar-map produces 'map-empty when the extractor returns no items. Both composites require an immediately-following ashlar-reduce; the validator refuses any position where a fan-out’s last-result could be read without a reducer closing it, raising a 'fanout-not-reduced error at compose time.

Stitch those rules together and the cumulative behavior is that a failure at any depth of nesting propagates outward without exceptions. Every ashlar above it on the composition tree gets a chance to recognize the failure, branch on it, or wrap it with more context, because the failure is a value passing through the same interfaces as a success. The pipeline stops, but the DAG is still inspectable and the trace is still complete.

6.3.11 Scope and rollup🔗ℹ

Every composition primitive produces a labeled sub-DAG at runtime. The children run inside that sub-DAG, and when the composite finishes, its nodes roll up into the parent.

The rollup is transparent. Every node a body places in its own sub-DAG during execution is merged into the outer DAG when the composite returns — for every primitive, including ashlar-loop. There is no per-primitive policy about what gets surfaced and what stays hidden at the DAG level. Scope labels exist during execution so that the validator can reason about locality, but once the composite finishes, those labels have done their job and the nodes are simply part of the outer DAG.

The topological composites — ~>, ashlar-loop, ashlar-match, ashlar-map, ashlar-parallel, ashlar-reduce — behave as pass-throughs for DAG queries. A body ashlar that calls dag-nearest-ancestor for a type produced by some earlier sibling, an outer pipeline step, or a prior iteration resolves that query through its sub-DAG’s parent chain and finds what it needs without the caller having to think about which scope the producer lives in. Inside a ashlar-loop, iteration N+1 sees every node iteration N produced — the tail "wins" because the walk starts at the most recent head and follows first-parent through that iteration’s output.

ashlar-map and ashlar-parallel add one extra rule: lanes can’t see each other’s mid-flight output. Each lane runs against a snapshot that captures everything outside the fan-out, but the lane’s own sub-DAG is isolated from its siblings for the duration of the fan-out. The reducer that follows — and only the reducer — sees the union of every lane’s rollup plus the outer state, which is exactly the shape a reduction wants: "collect across lanes, produce one aggregate."

Bodies that genuinely need state to persist beyond the primitive’s lifetime build that persistence into typed-node content. Agent ashlars do exactly this: the multi-turn conversation is built up as a list of message records during the agent’s execution, and when the agent finishes, the conversation list is embedded as 'conversation inside the content of the produced typed node. Downstream ashlars can read it off the node’s content if they care about the turns; otherwise it’s dead weight to them and they ignore it.

6.4 Agents and Tools🔗ℹ

An agent ashlar is an ashlar.

That sentence is worth pausing on, because agent work in most frameworks lives in a layer of its own — a thing outside the pipeline that the pipeline calls into, with its own lifecycle and its own way of being traced. Stone doesn’t do that. make-agent-ashlar is a constructor like make-ashlar. What it returns composes with ~>, wraps inside ashlar-loop, branches under ashlar-match, and emits the same lifecycle events as every other ashlar. From the topology’s point of view, a twelve-turn tool-using agent and a three-line parser are the same kind of thing.

This doc explains what an agent ashlar actually does when it runs, how middleware gives it capabilities, why tools are also middleware, and how a decision function tells the loop when to stop. It assumes you have read Ashlars and Edge Primitives.

6.4.1 Agent ashlars as ashlars🔗ℹ

make-agent-ashlar takes a caller, a node type to produce, a system prompt builder, a user message builder, a list of middleware, and a decision function. It returns a ashlar-meta — exactly what the composition primitives expect:

(make-agent-ashlar caller
  #:produces 'impl-written
  #:queries  '(test-written bounded-context)
  #:middleware (list (read-file  #:allowed-paths (list project-root))
                     (write-file #:allowed-paths impl-paths)
                     (edit-file  #:allowed-paths impl-paths))
  #:decide stop-on-write
  #:system (lambda (dag) ...)
  #:user   (lambda (dag) ...))

Inside that ashlar, a multi-turn LLM conversation runs: the agent calls the model, a tool middleware notices the response asked for a file, the tool executes, the result is folded into the conversation, the model is called again, and so on until a decision function decides the work is done or a turn budget is hit. All of that happens between the topology’s 'ashlar-start and 'ashlar-end for this ashlar. Upstream, a ashlar-loop with #:until impl-passes-tests? wraps it the same way it would wrap a parser, because the thing inside is an ashlar and that’s the only shape the loop primitive knows how to handle.

The payoff is straightforward. A retry predicate around an agent ashlar works like a retry predicate around any other ashlar. A healer running between iterations needs no special vocabulary for "the body was an agent." The validator sees #:queries and #:produces the same way it sees those on a deterministic ashlar, and refuses a broken coupling without ever calling the model.

6.4.2 Inside an agent ashlar🔗ℹ

Open the box. When the composition primitive invokes the agent ashlar against the current DAG, the ashlar does five things in order:

Call #:system and #:user against the DAG to build a system prompt and a first user message. Both are closures over the DAG, so the prompts can be shaped by anything prior ashlars produced — a config node, a test result, a human’s answer.
Initialize a private conversation — a list of message records — with the user message as its first entry. This list is the agent’s own memory for the duration of this ashlar, separate from the pipeline’s DAG.
Run the multi-turn loop: build a fresh context with the current conversation, run the middleware onion to the model, apply the decide function, and either return or loop for another turn, up to #:max-turns. Each turn’s assistant response — including any tool call blocks — is appended to the conversation exactly as the LLM returned it, so later turns see the full structured history.
When the loop exits, extract the final draft — the text content of the last assistant response — and inspect it. If #:response-format is set, the content is already a parsed hash; otherwise it is a string.
Produce a typed node of the declared #:produces type, embedding the conversation as 'conversation in the node’s content (not its meta). If #:response-format yielded a hash, the typed node’s content is that hash with 'conversation merged in. If the response was a bare string, the content is wrapped as (hasheq 'text <the string> 'conversation <the message list>). The agent errors at construction time if a user’s response-format schema declares a top-level 'conversation field, because the name is reserved.

That last step is the bridge between the agent’s turn-by-turn conversation and the topology’s ashlar-by-ashlar DAG. Downstream code that wants to inspect the agent’s conversation — which tools it called, in what order, what the model said between them — reaches for (hash-ref (node-content node) 'conversation) and gets back a list of message records. Debugging tools use this constantly. Normal ashlars never look at it: they care only about the typed content the agent produced, not about how the agent got there.

6.4.2.1 The conversation is a list of messages, not a DAG🔗ℹ

A conversation is plain data: a (listof message) where each message carries a role ('user, 'assistant, 'tool), content, optional tool calls, an optional call-id, and a metadata hash. The message struct lives in the stone/messages module, alongside the message-text accessor that extracts the textual representation of a message regardless of whether the content is a string, a hash, or a future multimodal-block list.

The conversation is severed from the outer pipeline DAG. It is not a dag, it has no parent pointer, and it is not serialized into the structural lineage of any other node. Every turn is a list element; turn ordering is list position.

(require stone/messages)

(define result-node ...)
(define conv (hash-ref (node-content result-node) 'conversation))
(for ([m conv])
(printf "~a: ~a~n" (message-role m) (message-text m)))

6.4.2.2 The middleware/caller contract🔗ℹ

A caller is a procedure given to make-agent-ashlar that takes the current LLM-call shape and returns the model’s response. From a caller’s perspective, the wire shape is unchanged — providers receive the same (listof hash) they always have. The framework converts internally between (listof message) and the wire format at the caller boundary; you do not write that translation yourself unless you are implementing a new caller.

Middleware reads two distinct surfaces on the context:

(context-dag ctx) — the outer pipeline DAG. Middleware uses this to query prior ashlars’ outputs (dag-nearest-ancestor, dag-query-all, ...). It is read-only from middleware’s perspective.
(context-messages ctx) — the conversation list, i.e. the (listof message) threaded through this agent’s loop. Middleware that injects extra turns or post- processes the model’s response operates on this list.

The split is intentional: the outer DAG is structure shared with the rest of the pipeline; the conversation is private data the agent owns. Conflating the two — as Stone did briefly while the conversation pretended to be a DAG — papered over the seam in a way that misled middleware authors and complicated debugging.

6.4.2.3 Provider-specific bits go in metadata🔗ℹ

Each message carries a metadata hash for provider-specific concerns the framework does not need to understand: cache directives, log probabilities, finish reasons, streaming chunk identifiers. The convention is that keys are namespaced symbols — 'anthropic/cache-control, 'openai/logprobs, 'vllm/finish-reason — so that two providers stashing bits on the same message cannot collide.

The framework does not inspect these keys; callers do. When you write a new caller for a provider, namespace your keys, and your caller can be composed with another’s without surgery.

6.4.2.4 One consequence for downstream consumers🔗ℹ

Embedding the conversation in content puts it in the same shape as the rest of the node’s payload, which is the right place for it. The hazard is that consumers used to treating the agent’s output as a bare string now see a hash when the agent had no response format. The string lives under 'text: reach for (node-get result 'text) when you want the raw draft, or (node-text result) to get either-shape tolerance for free.

The conversation list is JSON-friendly by default (no dag struct lurking), but if you serialize the node’s content to a report or webhook payload, you may still want to drop the conversation key to keep the payload focused: (hash-remove content 'conversation).

6.4.3 Middleware as capability composition🔗ℹ

A plain LLM call takes a prompt and returns text. To do anything more — read files, write files, run commands, remember past turns, inject a schema — you wrap that call with work before and after. Middleware is the unit of that wrapping.

A middleware, in Stone, is a named struct with a guard and a handler:

(make-middleware name
  (lambda (ctx) #t)                         ; guard
  (lambda (ctx inner-call)                  ; handler
    (define ctx-before (do-pre-work ctx))
    (define ctx-after  (inner-call ctx-before))
    (do-post-work ctx-after)))

The handler takes the current context and an inner-call — a function representing the rest of the chain plus, at the deepest layer, the model itself. The handler does its pre-work on the context, calls inner-call, does its post-work on what came back, and returns a new context.

Middleware composes as an onion. (list mw-a mw-b mw-c) means mw-a wraps mw-b wraps mw-c wraps the LLM call. On the way in, pre-work runs outer-to-inner: mw-a, then mw-b, then mw-c, then the model. On the way out, post-work runs inner-to-outer. Symmetric, composable, and — the useful part — each middleware owns both halves of its behavior inside one function. A tool middleware that injects a schema on the way in and dispatches a tool call on the way out does both in the same closure, sharing local variables.

For everyday ashlar authors, you never write run-onion yourself. You pass a middleware list to make-agent-ashlar and the framework handles the threading. The order is outer-to-inner — the first middleware in the list is the outermost ring.

6.4.4 Tools are middleware🔗ℹ

In Stone, a tool is not a registry entry, not a dict of handlers, not a separate abstraction with its own lifecycle. It’s a middleware. make-tool returns one.

(require stone/tools)

(define count-words-tool
  (make-tool 'count-words
    #:schema (hasheq 'name "count_words"
                     'description "Count words in a string"
                     'input_schema (hasheq 'type "object"
                                           'properties (hasheq 'text (hasheq 'type "string"))
                                           'required '("text")))
    #:handler (lambda (input)
                (define text (hash-ref input 'text))
                (define n (length (regexp-split #rx" " text)))
                (values (format "~a words" n)
                        (hasheq 'words n)))))

#:schema is the JSON schema the LLM sees in the tool-use section of the API. #:handler is a function from a parsed input hash to two values: a display-text string the next turn’s conversation will carry, and a structured-meta hash that gets attached to the tool’s entry in the agent’s conversation for later inspection. That’s the entire contract.

Why is this middleware and not its own abstraction? Because a tool’s job is exactly what middleware’s job is. On the way in, it injects its schema into the context’s tools list so the model sees it as an available capability. It calls inner-call, the model decides whether to use the tool, and the response comes back. On the way out, it checks the response for tool-use blocks matching its name, runs the handler for each one, appends a tool-role message to the conversation with the result, and emits a recommendation ("success, continue" or "denied, keep looping"). One function, pre-work, a call, and post-work — the onion was already the right shape.

Treating tools as middleware means three things in practice. A tool composes with other middleware freely — the same onion handles both. A tool’s safety policy (#:allowed-paths, #:confirm?) is local to the tool, so different agent ashlars in the same pipeline can have different policy surfaces for the same tool. And custom tools are written with the same API as any other middleware: no new vocabulary, no second abstraction.

On the next turn, the agent passes the conversation list — which now includes the tool-role message the tool middleware appended — to the caller, which sends the full history to the model. The tool result ends up in the conversation the next pass sends, paired with the tool call that triggered it. That’s how the model "sees" what its tool call produced: as a structured result in the conversation, not as a reply wired directly back into the same turn.

6.4.5 Decisions: when to stop🔗ℹ

An agent loop can run forever. Something has to tell it when to stop, when to return, and when to take another turn. That something is the #:decide function on the agent.

A decide function has the signature (context? (listof recommendation?) -> recommendation?). The context is the one the onion produced on the most recent turn. The list of recommendations is everything the middleware chain pushed onto the context during that turn — every tool that ran emitted one. The decide function looks at them and returns a single recommendation: 'halt to stop and surface as a failure, 'continue to exit cleanly with the current draft, or 'loop to take another turn.

Three patterns cover most real use:

continue-on-tool-use — the default. Loops while the recommendation list is non-empty, halts if any middleware halts, continues otherwise. This is what every agent ashlar gets when #:decide is omitted, and it handles the common "think, call tools as needed, then reply" pattern without any extra code.
tool-directed — a stricter variant: loops only when a middleware explicitly recommends 'loop. An empty recommendation list means the agent is done. Use this when you want tool middlewares to deliberately ask for another turn rather than inferring it from tool activity.
A custom decide — e.g. stop-on-write: loop until a successful write_file or edit_file tool call happens, then continue. A three-line function a pipeline author writes alongside their agent ashlar.

continue-on-tool-use and tool-directed ship in stone/decisions. A custom decide is a short function — that’s the whole API, and it’s deliberate that the termination rule is a first-class value and not buried in the framework.

For the single-shot case — one LLM call, schema-constrained, no tools — pass #:max-turns 1 with an empty middleware list and let the default decide run one pass. That’s the principled expression of "one turn, no loop."

6.4.6 The adversary and the healer🔗ℹ

A decide function tells the loop when to stop, but it can’t tell the agent whether its own output is good. That’s the adversary’s job.

#:adversary takes an ashlar — the same (DAG -> node) shape as every other ashlar. After the decide function says 'continue, the adversary runs against the pipeline DAG. A non-failure node means the output is acceptable; the agent produces its result and exits. A failure node means the output was rejected; the failure node’s message is appended to the conversation as a user-role message, and the agent runs another turn with that feedback in its conversation.

The DAG passed to the adversary includes the agent’s current draft as a typed node of the produced type — so (dag-nearest-ancestor dag produces) returns what the LLM just emitted, even though the agent hasn’t yet finalized its output.

(define check-has-tests
  (make-ashlar
    (lambda (dag)
      (define latest (dag-nearest-ancestor dag 'impl-written))
      (if (output-contains-tests? (node-content latest))
          (typed-node dag 'check-passed (hasheq))
          (make-failure-node (dag-heads dag) 'check-failed
            "Implementation is missing tests. Add unit tests before finishing.")))
    #:produces 'check-passed))

(define implement
  (make-agent-ashlar caller
    #:produces 'impl-written
    #:adversary check-has-tests
    #:middleware (list (read-file  #:allowed-paths (list project-root))
                       (write-file #:allowed-paths impl-paths))
    ...))

The adversary sees the same DAG the rest of the pipeline sees — it can query any node any prior ashlar produced, not just the agent’s own output. That makes it a general-purpose quality gate, not just a format checker.

#:heal-with is optional. When present and the adversary rejects, the healer ashlar runs before the agent’s next turn and its output is also appended to the conversation as a user-role message — immediately after the adversary’s feedback. The healer provides additional context: retrieved documents, computed values, anything the agent needs to do better that it can’t get on its own.

(define fetch-examples
  (make-ashlar
    (lambda (dag)
      (typed-node dag 'examples
        (retrieve-relevant-examples dag)))
    #:produces 'examples))

(define implement
  (make-agent-ashlar caller
    #:produces 'impl-written
    #:adversary check-has-tests
    #:heal-with fetch-examples
    #:max-healing 3
    ...))

#:max-healing (default 3) bounds heal cycles (reject → heal → retry), not adversary invocations. The adversary always votes at least once on the agent’s first draft, regardless of the healing budget. The budget only gates retries after a rejection.

Two semantic corners to keep in mind:

#:max-healing 0 is the gate idiom: adversary votes exactly once; pass → done, reject → 'healing-exhausted failure with no retry. Use this when the adversary is a yes/no filter and a rejection should be terminal.
#:max-healing N (for N >= 1) allows up to N heal cycles. The adversary can therefore vote up to N+1 times: once on the initial draft, then once per retry. The (+ N 1)th reject returns a 'healing-exhausted failure.

The historical sequencing of this check made #:max-healing 0 silently bypass the adversary entirely (the budget was consumed before the adversary ran). That has been fixed; the rule above is the contract.

6.4.7 Conversation-tail invariant🔗ℹ

There is one framework rule worth knowing about even if you never write a custom #:decide: the messages array going into an api-call never ends in an un-followed assistant turn.

Concretely, if your #:decide returns 'loop after the model produced a text-only response (no tool calls), the framework injects a synthetic (message 'user "Continue." ...) into the conversation before recursing. The next api-call’s messages array therefore ends in 'user, not 'assistant.

Why: some OpenAI-compatible servers — notably llama.cpp serving Qwen3 with 'enable_thinking — read a messages array ending in role: "assistant" as an assistant response prefill and reject the request with HTTP 400. The synthetic nudge prevents that shape from ever forming.

You will see this in trace payloads as a "Continue." user turn between two assistant turns. It is the framework, not the model, that put it there. The same situation does not arise with the default continue-on-tool-use decide function, because its empty-recs branch finalizes rather than loops — the nudge only kicks in for custom decide functions like stop-on-write that loop while waiting for some condition.

See also Provider constraints for related provider-specific gotchas Stone navigates on your behalf.

6.4.8 Shaping the output: #:finalize🔗ℹ

The default agent ashlar produces a typed node of the declared #:produces type. Its content is whatever the model returned — a parsed hash when the response format is set, or (hasheq 'text str) for bare-text replies — and the conversation list is injected into the content under the key 'conversation. For most agents this is exactly the right shape, and you will never think about it.

There is one pattern where it’s not enough. When the agent acts as an adversary — #:produces declares the success type, but a rejection should be a failure node carrying structured feedback — the default path can’t dispatch between the two shapes. #:finalize is the hook for that case. It receives the parsed content (the hash or string the model produced) and returns either a typed node of your chosen type or a make-failure-node. The framework post-processes typed-node returns by merging the agent’s conversation under 'conversation and leaves failure-node returns untouched.

(define review-implementation
  (make-agent-ashlar caller
    #:produces 'review-passed
    #:response-format verdict-schema
    #:max-turns 1
    #:finalize
    (lambda (parsed)
      (cond
        [(hash-ref parsed 'ok #f)
         (typed-node dag-unused 'review-passed parsed)]
        [else
         (make-failure-node '() 'review-rejected
           (hash-ref parsed 'reason "no reason given"))]))
    #:system (lambda (dag) "Review the implementation…")
    #:user (lambda (dag) ...)))

Two guardrails worth naming. The finalize function must return a node — the framework doesn’t try to coerce other values — and the content it returns must not already contain the reserved 'conversation key (the agent’s conversation list lives there; a collision would be ambiguous). Both violations raise loudly at runtime rather than silently corrupting the pipeline.

Reach for #:finalize whenever the parsed content has to dispatch between distinct node shapes. Skip it when the only outcome is "one typed node of the declared produces type" — the default path already handles that.

The key property of this pattern is conversation continuity. The adversary’s rejection and the healer’s context enter the agent’s actual conversation — the same list of messages the model reads from. The agent sees rejection and remediation as natural conversation turns, not as hidden state or magic context injection. It can refer back to them, reason about what it missed, and try a genuinely different approach.

This is conversation-level healing: the rejection and optional context stay inside the agent’s conversation where the model will see them on its next turn. For pipeline-level repair — changing the world between iterations of a ashlar-loop, not just changing what the agent knows — fold the repair work into the loop body as a sibling ashlar (see Edge Primitives).

6.4.9 Capabilities, not permissions🔗ℹ

One conceptual point is worth making explicit. Middleware is not a security boundary. It’s a capability boundary. An ashlar that doesn’t include write-file in its middleware list literally cannot write files: the tool schema never goes into the context, the model never sees the capability, and no tool-use block comes back. An ashlar that includes write-file with #:allowed-paths (list "/tmp/scratch/") can only write under that path because the tool’s handler rejects everything else. Both are enforced by construction, not by permission checks layered on top.

This is how Stone scopes what agents can do: by constructing different agent ashlars with different middleware lists for different parts of the pipeline. A planning agent might have read-file and nothing else. An implementation agent might have read-file, write-file, and edit-file, but only under src/. A documentation agent might have the same three, but only under docs/. Each is a separate ashlar, each carries its own capability surface in its source code where a reviewer can see it, and there is no global permission system because the middleware list is the permission.

6.4.10 Why this matters for pipelines🔗ℹ

Pull the threads together. An agent ashlar encapsulates LLM-with-tools work as a single unit of composition. Its middleware list is its capability surface. Its decide function is its termination rule. Its response format is its output contract with the topology. Its conversation rides along inside the produced node’s content as 'conversation, available to debuggers but irrelevant to ordinary pipeline coordination. From outside, it’s an ashlar — the same ashlar-meta that make-ashlar produces, composing with the same primitives, emitting the same events.

The upshot is that agent ashlars are not a separate design discipline. You think about them the same way you think about any other ashlar. Pick the node type it produces, the node types it queries, the middleware list that gives it the capabilities it needs and nothing more, the decide function that matches its role. Write a system prompt that describes the work, a user message that asks for it, and hand the result to ~> or ashlar-loop like anything else. The framework handles the turns, the tool dispatch, the schema injection, the conversation rebuild, the event emission, and the hand-off. You handle the topology.

One layer, all the way down — even when the layer contains an agent taking twelve turns to write a file.

6.5 Ask Human🔗ℹ

In Stone, asking a person a question is an ashlar.

That’s the whole premise of this document. Human interaction is not a hook on an agent, not a special interruption protocol, not a middleware that wraps an LLM call. It’s a constructor, make-ask-human, that takes its place in a topology like any other ashlar. A loop that retries until a human approves is just ashlar-loop. A pipeline step that gates on human input is just a place in ~> where a human ashlar happens to sit.

This doc explains why that shape is the right one, what the pieces are, and how a frontend plugs into a running pipeline without being part of it. It assumes you have read Ashlars and Edge Primitives.

6.5.1 Humans as pipeline participants🔗ℹ

The thing an ashlar does is take a turn. It reads the DAG, does some work, appends a node. A make-ashlar around a parser and a make-agent-ashlar around an LLM call both fit that shape, and an ashlar that pauses to ask a human fits it just as well. Its work is to solicit a response; its node is the response it got. That the work involves waiting on a keystroke rather than on a subprocess or an API round trip is a runtime detail the composition layer doesn’t care about.

Making human interaction an ashlar means the framework doesn’t need a second set of rules for pipelines that include people. The same ~> threads a human ashlar the same way it threads an LLM ashlar. The same ashlar-loop retries a body that contains a human ashlar the same way it retries one that doesn’t. The same validator walks a pipeline with human ashlars the same way it walks one without them. One layer, all the way down.

6.5.2 The ashlar: make-ask-human🔗ℹ

The constructor takes a channel bundle and a formatter for the question text:

(make-ask-human channels
  #:format-fn (lambda (dag)
                (format "Approve proposal: ~a?"
                        (node-get (dag-nearest-ancestor dag 'proposal)
                                  'summary)))
  #:name 'ask-approve
  #:produces 'human-response
  #:queries '(proposal))

#:format-fn is a function of the current DAG to a string. When the ashlar runs, it calls that function to build the question, which means the question can be shaped by anything the DAG knows: an earlier proposal, a prior failure, a list of choices produced by a classifier. The DAG is the source of truth; the formatter is the view onto it the human will see.

#:name, #:produces, and #:queries are the same metadata every other ashlar carries. #:queries tells the validator what the formatter is allowed to reach for. #:produces is the node type the ashlar will write. #:name is the ashlar’s identifier in logs.

At run time, the ashlar calls the formatter to build a question, puts it on the out channel, and syncs on either the in channel (for an answer) or the cancel channel (for an abort). It then produces a typed node whose content is (hasheq 'response answer-string) — a hash with a single 'response key carrying whatever the human supplied. A cancelled ask produces the same shape with an empty string, so downstream code can always reach for 'response without a type check. Because the output is an ordinary typed node, a loop predicate or a ashlar-match extractor can inspect the human’s answer exactly the way it inspects an LLM’s structured output.

6.5.3 Channels are the medium🔗ℹ

Why channels rather than callbacks or return values? Because channels are synchronous from the ashlar’s side and asynchronous from the frontend’s side, and that asymmetry is exactly the one human interaction needs. The ashlar blocks on channel-get until a response arrives — so an ashlar waiting on a human looks the same to the composition primitives as an ashlar waiting on a tool dispatch. Meanwhile the frontend — a stdin reader, a TUI event loop, an HTTP handler — pulls the question off the out channel whenever it’s ready, displays it however it wants, and posts the answer back. Producer and consumer live in different threads and share nothing except the channels.

An ask-human-channel bundles three of them:

out — the ashlar puts questions here; the frontend reads them.
in — the frontend puts answers here; the ashlar reads them.
cancel — either side can signal abort.

Plus a name field for identification. Every ask-human ashlar has its own channel bundle, so a topology with three ask-human ashlars has three separate, independently addressable channels.

6.5.4 Channels flow from builder to frontend as data🔗ℹ

The architectural choice is that channels are collected by the pipeline builder and handed to the frontend as plain data. Constructing an ask-human-channel with make-ask-human-channel inside the dynamic extent of call-with-collected-ask-human-channels auto-registers it with the currently active collector. The helper wraps the builder, runs it with a collector active, and returns both the built pipeline and the list of channels that were built during construction:

(define-values (pipeline channels)
  (call-with-collected-ask-human-channels
    (lambda ()
      ; every make-ask-human-channel call inside here
      ; auto-registers with this call's collector
      (~> propose verify ask-approve))))

The frontend receives channels and syncs on all of their out channels:

(for ([ch channels])
  (handle-evt (ask-human-channel-out ch)
    (lambda (question)
      (display-question-and-get-answer
        question
        (ask-human-channel-in ch)
        (ask-human-channel-cancel ch)))))

The frontend doesn’t know what ashlars exist, what they’re asking, or where they live in the topology. It listens to everything the builder handed over and routes answers back on the matching channel.

Auto-bubble for nested builders. The collector table is shared across every call-with-collected-ask-human-channels in the dynamic extent: only the outermost call creates a fresh table, and nested calls reuse it. A channel built inside any depth of nested composition still lands in the outermost collector’s returned list, so a top-level pipeline builder gets every channel regardless of how deeply a sub-builder nested its own collector.

Collision at construction. Every channel has a name field, and that name is the registry key. Two channels with the same name inside the same collector raise at construction time rather than silently overwriting — a duplicate is almost always a bug (two ashlars fighting over the same rendezvous channels) and catching it loudly is cheaper than chasing a confused frontend later.

Why this shape rather than a module-global registry? Pipelines in Stone are composed bottom up — ashlars are created, combined into sequences and loops, then handed to run-pipeline. A global registry was a rendezvous point between builder and frontend, but it hid the fact that the builder produces an extra output (channels) and coupled every process to one live topology. The collector helper makes that coupling explicit as a return value: two topologies can be built in the same process without cross-talk, tests get per-construction isolation by default, and the data flow is readable in the source — no hidden global state.

6.5.5 The frontend is not part of the topology🔗ℹ

An ask-human ashlar is an ashlar. A frontend is not. The code that displays questions and reads answers doesn’t participate in the DAG, doesn’t emit ashlar lifecycle events, and doesn’t declare #:produces or #:queries. It’s a long-lived thread alongside the pipeline whose entire job is to move strings between the channels and a human.

The consequence is that the same topology runs under different frontends with no topology changes. Unit tests plug in a test frontend that answers every question with a canned response. A demo runs it through a stdin reader. Production ships it with a TUI. The ashlars don’t change because the thing that answers them does.

6.5.6 Composing human interaction🔗ℹ

Human ashlars compose with every other primitive, and the common patterns fall out naturally.

An approval gate is just a sequence:

(~> propose-change ask-approve apply-change)

A retry-until-approved loop is just ashlar-loop with the human ashlar inside the body:

(ashlar-loop (~> propose verify ask-approve)
#:until approved?
#:max 5)

A learning loop is the same shape — the human ashlar lives in the body, and a ashlar-match decides whether to ask the human based on what the first half of the body produced:

(ashlar-loop
  (~> discover-config
      (ashlar-match (lens 'complete?)
        [#t noop]
        [#f ask-for-missing]))
  #:until has-required-fields?
  #:max 5)

On a clean first pass noop runs and the predicate exits the loop. On a pass where fields are missing, ask-for-missing fires, lands the answer on the DAG, and the next iteration’s discover-config sees strictly more information. The human is asked only when they’re needed.

Branching on a response is ashlar-match against the response node:

(ashlar-match (lens 'response)
["yes" apply-change]
["no" cancel-change])

None of these compositions require special knowledge that a human is involved. The point is that they don’t have to.

6.5.7 Why not middleware for human input?🔗ℹ

Readers coming from frameworks where every interaction with an agent goes through a middleware chain will look for a hook where a middleware could ask a human. Stone doesn’t offer that shape. Middleware wraps an LLM call, running once per conversation turn, but pipelines rarely want to ask a human at every turn — they want to ask at specific points in the topology: after a proposal is drafted, before a plan is applied, when a learning loop is short on information. Those points are steps in a composition tree, not per-turn events.

Middleware is also invisible to the topology. A frontend that wants to know "which ashlars can ask the human?" inspects the channels list returned by the pipeline builder; a middleware-hidden interaction would be opaque to the same query. Treating ask-human as an ashlar keeps the interaction visible, discoverable, and composable with every primitive.

6.6 Validation🔗ℹ

A Stone pipeline can be checked before any ashlar runs.

The check is purely structural. It reads metadata that every ashlar carries on its ashlar-meta struct, walks the composition tree from the root, and verifies that every node type an ashlar tries to read has been produced by some ashlar upstream. It runs no ashlar bodies. It calls no LLMs. It builds no DAG. The whole pass is a traversal over metadata.

This doc explains why Stone has a compose-time validator, what shape the validator’s questions take, what it catches, and — just as importantly — what it can’t. It assumes you have read Ashlars and Edge Primitives.

6.6.1 Validation as a compose-time check🔗ℹ

A real Stone pipeline is expensive to run. The hot parts are LLM calls, agent loops, tool dispatch, human prompts, and a pipeline of any real size strings several of them together. That makes a whole class of bugs uniquely painful: a structural mistake a downstream ashlar will hit, but only after every upstream ashlar has done its real work. A typo in a #:queries list. A refactor that renamed a produced type in one place and not another. A fresh ashlar-match branch that forgot to query the config loaded before the match. Each of those is visible in the topology itself, but without a validator it surfaces twenty minutes in, after the bill is paid.

The validator exists to turn that twenty-minute failure into a one-second check. validate-pipeline takes a pipeline, walks the tree once, and returns a validation-result with whatever errors and warnings it found. The call is pure and fast enough to run on every commit in CI, every startup before a deployed pipeline’s first real run, or every save in a live dev loop.

6.6.2 What metadata the validator reads🔗ℹ

Every ashlar-meta carries the pieces a structural check needs:

produces-all — the node types this ashlar adds to the DAG.
queries — the node types this ashlar reads.
children — the sub-ashlars, if this ashlar is a composite.
validate-walk — a function (ashlar available -> (values errors new-available)) that encodes how this ashlar participates in static validation. Each composition primitive supplies its own walk rule (leaf, sequence, loop, match, map, parallel, reduce); custom composites built on make-scoped-ashlar can supply their own.
lens — a lens extractor, when the ashlar uses one.
schema — the JSON schema derived from #:response-format, when the ashlar is an LLM ashlar that returns structured output.

All of it is set at construction time. make-ashlar populates the leaf fields directly. The composition primitives each build the metadata of the composite they return by unioning or filtering their children’s. By the time run-pipeline is called, the tree is fully decorated, and the validator never has to run anything to learn what it needs.

6.6.3 Walk-rule dispatch🔗ℹ

The heart of validate-pipeline is a thin driver: it recurses into each ashlar’s children, invoking the ashlar’s own validate-walk rule to ask what types the ashlar adds to the available set and what errors to raise. The walker doesn’t switch on any tag — each ashlar carries its own rule, and the rules below describe what the built-in compositions do.

Each walk mirrors the way its primitive actually runs at runtime — but statically. The rules below describe "available inside" and "available after" for each one.

Leaf. Check the ashlar’s queries against the available set. Any query not in the set becomes a 'missing-producer error. Return available plus the ashlar’s own produces-all.

Sequence. Walk the children left to right, threading the available set through. An ashlar at position three in a sequence sees everything that was available before the sequence plus everything the first two children produced. This is the rule that makes ~> feel like normal threading even though nothing is literally passed. The walker also reads the most recent child’s schema field for lens validation. The branching composites — ashlar-match, ashlar-map, and ashlar-parallel — deliberately leave their schema field #f, because their output semantics are ambiguous, so lens validation across those boundaries is skipped rather than guessed.

Loop. A loop body sees what was available before the loop plus everything the body itself produces. The self-reference is not a trick: loops run multiple times, each iteration writes to the same DAG the next iteration reads from, so a query on a type the body produces is genuinely satisfied by the previous iteration’s work.

Match. Each branch is walked against the same pre-match available set. After the match, the available set grows by the intersection of every branch’s new types — types every branch is guaranteed to produce — and the union minus intersection becomes a set of 'maybe-unavailable warnings. Those are the types produced by some branches but not all; the validator can’t know at compose time which branch will run, so it flags them rather than admitting them or rejecting them.

Parallel and map. Lanes are isolated. Each lane is walked with the same pre-fan-out available set, because at runtime each lane gets a snapshot and can’t see its siblings. After the fan-out, the outer scope’s available set grows by the union of every lane’s produces-all — every lane actually runs, so that part isn’t conditional.

Reduce. The body is walked against the available set unchanged. ashlar-reduce is thin at runtime. At construction time, the wrapper copies the inner ashlar’s name and schema into its own fields, so a validator error attributed to the reduce wrapper identifies the inner ashlar by name.

The thing worth noticing is that these rules are just the runtime semantics translated into set algebra over types. The validator does what the interpreter does, except with the types stripped out of the nodes and everything running in a few microseconds.

6.6.4 What the validator catches🔗ℹ

Concretely, four kinds of problem surface as validation-error values, each tagged with a symbol in the type field:

'missing-producer. An ashlar queries 'foo but no upstream ashlar produces it. The error names the offending ashlar, the offending query type, and a human-readable message. This is the most common finding, and the one the validator was built for.

'maybe-unavailable. A query is satisfied by some branches of a match but not all. Because control-flow uncertainty lives outside the validator’s model, this surfaces as a warning rather than a hard error — the designer may know something the validator doesn’t. When it shows up, the usual fix is to produce the type in every branch (even as a defaulted stub), or to refactor the downstream query into one of the branches where it’s definitely available.

'invalid-lens. A ashlar-match uses a lens extractor to pull a value from the most recent node, and the previous ashlar in the sequence has a schema whose properties don’t contain the lens’s first path segment. This catches typos in lens expressions and schema drift — a renamed field in one place that the lens wasn’t updated for. It’s a hard error, because the mismatch would make the extractor throw at runtime.

'fanout-not-reduced. A ashlar-map or ashlar-parallel whose output flows to a downstream consumer without a ashlar-reduce between them. Fan-outs produce multiple lane outputs, and the runtime returns whichever was last in source order — a semantically meaningless choice. Requiring an immediately-following reducer forces the designer to collapse the lanes explicitly via dag-query-all, so downstream ashlars see a principled aggregate instead of a source-order accident.

(define result (validate-pipeline my-pipeline))
(if (validation-ok? result)
    (displayln "pipeline is valid")
    (for ([err (validation-errors result)])
      (displayln (validation-error-message err))))

validation-ok? returns #t when the errors list contains only warnings (currently just 'maybe-unavailable) or is empty.

6.6.5 What the validator doesn’t catch🔗ℹ

The validator’s job is narrow on purpose. It doesn’t run ashlars, and so it doesn’t catch anything that only surfaces at runtime:

LLM responses that don’t match the schema, network failures, flaky tools, timeouts. None of those can be known before the call is made.
Semantic mismatches between a query and its producer. If two ashlars both call a type 'requirement but mean different things by it, the validator says they match. Type names are a coordination vocabulary, not a type system in the compiler-theory sense.
Loop termination. A ashlar-loop will exhaust its #:max or satisfy its #:until predicate, and which one happens depends on runtime state the validator doesn’t touch.
External state. A tool that reads a file the validator can’t see, a service the pipeline talks to that happens to be down — the validator is oblivious to both.

All of those are jobs for failure nodes, observability, and the test suite. The validator’s contribution is taking one specific failure mode — structural incoherence visible in the topology itself — and moving its detection from minute-nineteen of a run to millisecond-one of raco stone validate.

6.6.6 Enumeration🔗ℹ

Two helpers read the same metadata and answer adjacent questions.

enumerate-ashlars returns a flat list of every ashlar name reachable from the pipeline root. It’s useful for inventorying a pipeline, driving simple visualizations, or generating a checklist for a review.

One adjacent check worth naming: the orphan-ashlar guard. A ashlar can be defined, exported from a module, and never actually wired into a pipeline. The validator doesn’t catch that — an unused ashlar isn’t a structural error — but enumerate-ashlars gives you the list of ashlars actually reachable from your pipeline root, and diffing that against your module’s declared ashlars catches leaks:

(define declared '(load-config classify implement summarize cleanup))
(define wired    (list->set (enumerate-ashlars pipeline)))
(define orphans  (filter (lambda (s) (not (set-member? wired s)))
                         declared))
(unless (null? orphans)
  (error 'my-pipeline "declared but not wired: ~a" orphans))

Run that alongside validate-pipeline at load time, in CI, or in a test, and "I added an ashlar and forgot to use it" becomes a caught bug instead of a silent dead-code entry.

enumerate-paths returns the distinct execution paths through the pipeline as lists of ashlar names. Sequences concatenate along a single path. Matches and parallels fork into several. Loops collapse to a single path representing one iteration of the body.

6.6.7 Running validation🔗ℹ

The most direct entry point is the function:

(require stone/validate)

(define result (validate-pipeline my-pipeline))
(unless (validation-ok? result)
  (for ([err (validation-errors result)])
    (printf "~a: ~a~n"
            (validation-error-ashlar-name err)
            (validation-error-message err))))

For scripting and CI, the same check is wrapped by a CLI: raco stone validate <file> loads a pipeline module, validates it, and reports results with a non-zero exit on hard errors. The whole pass takes milliseconds because nothing runs — which is the point. Wire it into a pre-commit hook and the class of "I refactored a produces symbol and broke three downstream queries" bugs becomes the kind a machine catches, not the kind that costs a pipeline run.

6.7 Observability🔗ℹ

A running Stone pipeline is a tree of ashlars calling ashlars, an LLM being prompted, tools being dispatched, humans being asked, failures being produced. Watching that as it happens — and reconstructing it afterward — is the job of observability.

This doc explains how Stone’s observability is shaped, why it’s shaped that way, and how a consumer plugs in. It assumes you have read Ashlars and Edge Primitives.

6.7.1 Observability as a module-level concern🔗ℹ

In Stone, observability isn’t something you pass around. It’s not an argument to an ashlar, not a field in a context struct, not a hook you install on a pipeline object. It’s a single logger — stone-logger — bound once at module load, which every ashlar in the framework emits to. Consumers subscribe. Producers neither know nor care who is listening.

The alternative is to thread an observer through every call: a ashlar takes an observer as an argument, passes it to the ashlars it composes with, and the tree is plumbed end to end. That shape works until the day ashlar-loop wraps a body and needs to know what kind of observer to hand it; until the day a middleware inside an agent inside a ashlar-map lane wants to log something and has to reach four levels up the call stack to find the thing. Composition primitives would all need to grow an awareness of observability that has nothing to do with their job.

The module-level logger makes that whole class of plumbing disappear. A deeply nested ashlar emits to stone-logger. If no receiver is subscribed, emission is nearly free. If a receiver is subscribed — anywhere in the process — every event flows to it without any ashlar in the composition tree ever being asked to cooperate.

6.7.2 Using Racket’s logging🔗ℹ

Stone’s observability is built on racket/logging, the logging machinery that comes with Racket. There is no custom protocol. The logger is defined once, at the top of stone/logging.rkt:

(define-logger ashlar #:parent #f)

#:parent #f is the quiet but important half of that line. It isolates the logger: Stone events only reach receivers that explicitly subscribe to stone-logger, and they do not bubble up to Racket’s root logger. That matters because Stone is the kind of library embedded in a larger Racket application, and a flood of 'ashlar-start/'ashlar-end events propagating up into the application’s default log would be a nuisance. Isolation means Stone’s (deliberately noisy) internal events stay where they belong until a consumer asks for them.

Adopting the built-in logger has quiet benefits beyond "not writing a protocol." Racket’s logger is thread-safe. It’s effectively free when nothing is subscribed. Receivers are composable — two consumers can subscribe to the same logger at different levels without interfering. And with-intercepted-logging, a testing utility that knows how to capture events from any logger, comes with Racket, so Stone gets a test harness for free.

6.7.3 Correlation via parameters🔗ℹ

A Stone pipeline is a tree of calls, and the observability layer needs a way to reconstruct that tree from the stream of events. Every event Stone emits therefore carries three correlation fields: a trace-id, a span-id, and a parent-span-id. Together they form a tree that mirrors the call tree exactly.

Those three fields come from three Racket parameters:

(define current-trace-id        (make-parameter #f))
(define current-span-id         (make-parameter #f))
(define current-parent-span-id  (make-parameter #f))

A caller sets them at the top of a pipeline run with parameterize:

(parameterize ([current-trace-id (generate-id "run-")]
[current-span-id (generate-id "span-")])
(run-pipeline pipeline (make-dag)))

Racket parameters have dynamic extent. Inside a parameterize body — and inside every function that body calls, to any depth — the parameter reads return the bound value. Control leaving the parameterize scope restores the previous binding. No ashlar has to receive the trace-id as an argument or thread it through its return value; it just reads (current-trace-id) when it needs it.

The span tree is where this gets interesting. make-ashlar’s wrapper, every time it runs an ashlar, re-parameterizes the span-id with a fresh value and captures the previous one as the parent:

(parameterize ([current-parent-span-id (current-span-id)]
               [current-span-id (generate-id "span-")])
  ; run the ashlar body here
  ...)

That handful of lines is what makes span correlation automatic. A ashlar calling another ashlar becomes the parent span. An ashlar calling an ashlar calling an ashlar becomes a grandparent span. The span tree matches the call tree with no manual plumbing anywhere in the pipeline — because the call tree is the dynamic parameterization scope.

6.7.4 Events carry typed data🔗ℹ

Emission happens through a single helper, stone-event, which is the only API ashlars use to produce events. Its job is to enrich the data hash with the event symbol, the current correlation parameters, and a timestamp, then dispatch to the right log level:

(stone-event 'info 'api-call
  (hasheq 'ashlar-name effective-name
          'model effective-model
          'messages messages
          ...))

Every event that reaches a receiver has the same shape of data hash: 'event (a symbol naming the kind), 'trace-id, 'span-id, 'parent-span-id, 'timestamp, and whatever additional keys the caller passed in. The event symbol is the discriminator; the rest of the hash is event-specific detail.

The framework emits a known vocabulary of events at deliberate levels:

Debug — per-ashlar bookkeeping that is noisy by design. 'ashlar-start and 'ashlar-end (emitted by make-ashlar’s wrapper around every body call), and 'middleware-run (emitted as each middleware’s guard is evaluated).
Info — the default observable layer. 'api-call and 'api-response (around every LLM call inside make-agent-ashlar’s internal loop), 'agent-ashlar-start and 'agent-ashlar-end (around make-agent-ashlar’s outer wrapper), and 'tool-dispatch (one per tool call). This is the set a TUI or a trace-writer subscribes to by default.
Error — failure events, auto-emitted when any ashlar creates a failure node. More on that below.

User-defined ashlars aren’t limited to the framework vocabulary. A ashlar author who wants to emit a domain-specific event at any level, with any data hash, calls stone-event with that symbol. The correlation fields and timestamp get baked in automatically.

6.7.5 Receivers and the consumer pattern🔗ℹ

A consumer subscribes to stone-logger via make-log-receiver from racket/logging and runs a background thread that pulls events off the receiver:

(require racket/logging)

(define receiver (make-log-receiver stone-logger 'info))

(thread
  (lambda ()
    (let loop ()
      (define evt (sync receiver))
      ; evt is a vector: #(level message data topic)
      (handle-event evt)
      (loop))))

Two details are worth pausing on. First, the receiver is subscribed to stone-logger specifically, not to (current-logger). This is the consequence of #:parent #f: Stone events never reach the root logger, so a receiver listening to the root logger sees nothing from Stone. Second, the event shape is a Racket log vector — #(level message data topic) — and the structured data hash that stone-event built is at index 2. The consumer’s job is to pull that hash out and do whatever it wants with it: append a line to a JSONL file, update a TUI pane, forward to a metrics sink, increment a counter. Stone doesn’t care.

The level passed to make-log-receiver filters what gets delivered. A receiver subscribed at 'info sees info, warning, error, and fatal; it doesn’t see debug. A receiver subscribed at 'debug sees everything. The filtering happens before the event is enqueued, so a debug-level emission in a tight loop is cheap when nobody is listening at debug level.

6.7.6 Testing with with-intercepted-logging🔗ℹ

Tests that want to assert on what a pipeline emitted use with-intercepted-logging, also from racket/logging. It runs a thunk, captures events that match a filter through an interceptor function, and lets the test inspect them afterward:

(define events '())
(with-intercepted-logging
  (lambda (evt) (set! events (cons evt events)))
  (lambda () (run-pipeline my-pipeline (make-dag)))
  #:logger stone-logger
  'info 'stone)

The critical wrinkle is #:logger stone-logger. Without it, with-intercepted-logging attaches to the default logger, and the test sees nothing — because, again, #:parent #f means Stone events never reach the default logger. A test that forgets this keyword will pass in a way that has nothing to do with what the pipeline actually did.

6.7.7 Failures are auto-logged🔗ℹ

When any ashlar creates a failure node via make-failure-node, an error-level event is emitted automatically. The mechanism is a hook: dag.rkt exposes a parameter failure-log-handler that make-failure-node calls if it’s set, and stone/logging.rkt installs a handler at module load time that emits through stone-event at 'error level. The effect is that every failure node, anywhere in a pipeline, generates a log record without the ashlar needing to remember to emit one.

This is the right default. Failures are by definition the moments a pipeline wants to know about, and requiring ashlars to opt in to logging them would mean the moments that matter most are the ones most likely to be missed. The hook-at-load-time shape also keeps the dependency one-way: dag.rkt doesn’t import stone/logging.rkt, so the DAG module stays standalone and logging is a layer installed on top of it by whoever requires both.

One trade-off is worth naming. If a user-defined ashlar creates failure nodes in a module that never (even transitively) requires stone/logging, the hook is never installed and those failures are silent. In practice, any ashlar that composes with the framework’s primitives has already transitively required the logging module through edge.rkt, so this is rarely a problem in real pipelines.

6.7.8 Streaming🔗ℹ

The OpenAI-compatible caller emits per-token events through an outbox when one is provided to make-agent-ashlar; the Anthropic caller accepts an outbox argument and silently ignores it. See Provider constraints for the full story and what to do about it.

6.8 The test harness🔗ℹ

The stone/test harness exists because one specific failure taught us that some bugs live below the layer ordinary mocks can reach. This doc explains the shape of the harness — why it intercepts where it does, why stubbing works by name, why two modes share one API — by walking back from that failure to the choices it forced.

It’s a companion to the Test ashlars with tool calls how-to. If you haven’t used the harness at all, read the how-to first; this doc assumes you have a rough sense of with-live-harness, with-mock-harness, ashlar-with-tool-stub, and check-tool-called?.

6.8.1 The problem: mocks can lie about LLM behavior🔗ℹ

A live run of a TDD orchestrator against an empty project surfaced a class of bug that existing test suites couldn’t catch. The discover agent ashlar was wired with an ask_user tool and a prompt that told the model to ask the human when information was missing. Against an empty directory, the model did the worst possible thing: it wrote the question in prose ("Could you please tell me: 1. What language is this project written in? 2. What test framework...") as ordinary assistant text, instead of emitting a tool-use block. No tool dispatch ran. The channel never activated. The frontend was never prompted. The pipeline took the prose turn as the agent’s final output, produced an empty config node, and the next ashlar in the sequence happily wrote garbage.

A mock-based test couldn’t have caught this. The reason is specific: a mock of the ask_user tool substitutes in at a layer above the prompt boundary. The real failure is a prompt-behavior binding — "does the system prompt, the tool schema, and the DAG state conspire such that this model, under these conditions, reliably emits a tool call?" — and that binding only exists when a real LLM sees a real prompt with a real tool schema attached. The only test that catches it is one that runs the real call and then asks: did a tool dispatch happen, or not?

The harness exists so that question costs one line to ask.

6.8.2 Why intercept at tool dispatch🔗ℹ

The harness could have intercepted in several places. Above, at the LLM response: capture the raw assistant turns, assert on their structure. Below, at the filesystem or network: capture whatever the tool actually did, assert on side effects. We chose the layer in between — the point where Stone’s make-tool middleware fires its handler.

Intercepting at the LLM response layer would tie tests to provider-specific message shapes. Stone is provider-agnostic; a test that reads response.content[0].type == "tool_use" is an Anthropic test, not a Stone test. Worse, the shape check is brittle — future model versions rearrange response content in minor ways, and every test that inspects raw responses breaks.

Intercepting at the side-effect layer (filesystem, stdin, network) is slow and fragile. Tools can be chained, batched, or stubbed at a lower level already; asserting on what write_file wrote is an integration test, not a unit test of the ashlar.

Tool dispatch is the right semantic boundary. It’s where an agent’s intentional behavior meets the outside world: "this ashlar decided to call ask_user with this argument." That is the content of the question the motivating failure was about. Assertions at this layer read the way humans talk about agent behavior.

6.8.3 Why stub by name, not by object identity🔗ℹ

An earlier sketch of the API took a tool value as its stub target: (ashlar-with-tool-stub s my-tool-value stub-handler). That version was abandoned. The reason matters.

Ashlars compose. An ashlar author who wires a discover ashlar with an ask_user tool does so by constructing the tool once and handing it to make-agent-ashlar as middleware. In the production module, that ashlar binding is the thing everything else imports. An identity-based stubbing API would force a test to either import the tool value and the ashlar, and then ensure the same tool value ended up on the stub call — or, more likely, rebuild the entire ashlar from its constituent parts inside the test, duplicating the production wiring. Every wiring change would require a parallel edit in every test. Tests would drift from production wiring silently.

Name-based stubbing inverts the relationship. A test imports the ashlar. It asks: "on this ashlar, substitute the handler for whichever middleware has name 'ask_user." The question is identical to how a maintainer would describe the operation in prose. The test never needs to know which tool value was wired, only which tool name the ashlar declares. When production wiring changes — a new tool is added, an old one is reconfigured — the test still finds 'ask_user by name and still substitutes correctly.

6.8.4 Why rebuilder closures on ashlar-meta🔗ℹ

Stone-meta is immutable. That’s not an implementation accident; it’s load-bearing. An ashlar is a value that can be composed and inspected freely because its identity doesn’t drift underneath consumers. Swapping one tool’s handler on a live ashlar would violate that.

The alternative shape — the one rejected — would have been to store middleware in a mutable box and let the test poke a new handler into the box. That works for one test, but it leaks: the next test sees the mutated ashlar, and the framework’s guarantee that an ashlar-meta is a stable value falls over.

The shape we took instead: every ashlar that supports stubbing carries a rebuilder closure on its ashlar-meta. The closure captures the kwargs of its constructor and knows how to rebuild the ashlar with a new middleware list. ashlar-with-tool-stub walks the ashlar tree, finds the ashlar that owns the tool in question, calls that ashlar’s rebuilder with the substituted middleware, then walks back up calling each parent’s rebuilder with the substituted child. At the end, a completely new ashlar value exists, identical to the original except for the one substituted handler. The original ashlar is untouched. Later tests that import the same production ashlar binding see the same original.

This pushes the cost of rebuilder plumbing onto the composition primitives — each of ~>, ashlar-loop, ashlar-match, and so on — but that cost is mechanical and one-time. Each primitive’s rebuilder is a closure over the exact same arguments it took when built, just with substituted children. The ceremony is small; the immutability is preserved.

6.8.5 Why the recorder reuses stone-logger🔗ℹ

Stone already emits a 'tool-dispatch event on every tool call through its module-level logger. Adding a second recording channel for tests — a parallel observer pattern, a hook list on make-tool, a test-only parameter thread — would mean two observation systems to keep in sync. Every time Stone gained a new tool-related event, both systems would need to learn about it.

The harness reuses the existing log surface. with-live-harness and with-mock-harness install a log receiver via make-log-receiver on stone-logger, filter to 'tool-dispatch events, and accumulate them into a per-test list that tool-calls, check-tool-called?, and the other accessors read from. The test harness is another consumer of the same stream the TUI consumes. There is no separate "test channel" — the TUI pattern and the test pattern are the same pattern.

A nice consequence: the recorder sees every tool dispatch, not just the stubbed ones. If a test stubs ask_user but not read_file, and the agent calls both, both calls are recorded. Assertions like check-tool-called?'read_file work on pass-through tools with no additional machinery.

6.8.6 Why the recorder is synchronous🔗ℹ

An early version of the recorder ran a pump thread: a background loop that pulled events off the log receiver and pushed them into the accumulator. Tests called (tool-calls) and got whatever the pump had captured so far.

This was non-deterministic. On a fast run, the agent finished, the test called (tool-calls), and the pump hadn’t yet scheduled. The test saw an empty list. On a slow run, the pump caught up and the test saw what it expected. The scheduling race was subtle enough that the failure mode looked like "this assertion is flaky" rather than "this recorder has a bug."

The current recorder drains the log receiver synchronously. Every call to (tool-calls) — and every assertion built on it — pulls pending events from the receiver in a tight sync/timeout 0 loop before reading the accumulator. There is no background thread, no scheduler dependency, no pump to go wrong. Ordering is guaranteed: events emitted before the assertion are visible to the assertion.

For a test harness, determinism beats throughput. We aren’t recording a million events a second; we’re recording a handful per test. The synchronous drain pays a negligible cost for a guarantee that "the tool calls I see are the tool calls that happened."

6.8.7 Why two modes share one API🔗ℹ

with-live-harness and with-mock-harness are different beasts internally. One wires a real HTTP caller, optionally bounds execution on a timeout thread, and records behavior that depends on the model’s disposition. The other wires a scripted caller that pops from a list of pre-built llm-response values, runs synchronously, and records behavior that depends on the author’s script. They don’t share implementation.

They share an API. The assertions check-tool-called?, check-tool-not-called?, check-tool-call-count read identically in both. tool-calls returns the same record shape. ashlar-with-tool-stub produces a stubbed ashlar usable in either harness. stub-answer and stub-fn work in either.

The reason is migration cost. A test written against the mock harness today may, six months from now, want to become a live test — because a new model version is suspected of regressing, or because the wiring has changed and the mock script no longer matches production. If the two modes had different assertion vocabularies, migrating would mean rewriting every assertion in the test. Authors would hesitate, and tests would stay in the wrong mode longer than they should.

Shared API means migration is a one-line change: swap with-mock-harness #:responses ... for with-live-harness #:caller c and delete the script. Everything else in the test body stays the same. Authors migrate freely.

6.8.8 What’s out of scope in v1🔗ℹ

Several related capabilities were deliberately deferred:

Transcript replay. The recorder captures live data per test; it doesn’t persist it for replay. A replay mode would need to solve "what counts as the same prompt?" (literal equality? structural match? semantic similarity?) and we didn’t want to commit to an answer. Replay is a future layer that can sit on top of the recorder; the recorder itself doesn’t bake in assumptions about it.
Retry / flakiness handling for live tests. Real LLM calls are occasionally flaky. The harness doesn’t retry or mark tests as flaky; it fails loudly and lets the author decide. Auto-retry hides behavior drift, which is the thing live tests exist to catch.
Multi-thread or async ask-human scenarios. The harness drains synchronously on the test thread. Ashlars that spawn background threads to coordinate with a frontend aren’t supported; the test would need to orchestrate those threads itself.
Frontend simulation. The harness intercepts at tool dispatch. It never touches the real channel registry or any frontend that might be listening. A stubbed ask_user doesn’t activate a channel; it returns the stub answer and the agent continues. Tests that need to exercise a real frontend exit the harness and run the full application.

Each deferral protects the same property: the harness does one thing — substitute tool handlers and observe tool dispatches — and does it the same way in every test. Features that would add modes or modes-within-modes were pushed out until the need for them is concrete.

6.9 The ashlar-pair pattern🔗ℹ

Sometimes you want a single LLM step that both uses tools and produces structured output. A pipeline ashlar that calls read_file to gather context, then emits a JSON project configuration; a verifier that inspects the filesystem, then returns a structured pass/fail verdict; a research agent that browses the web and hands back a typed summary.

On Anthropic’s API, that works in a single make-agent-ashlar call: pass a tool-bearing middleware list and a #:response-format, and the model figures out when to emit tool calls versus when to emit the final structured answer.

On most OpenAI-compatible servers running open-weight models — vLLM being the dominant example — it doesn’t. Enforcing a JSON schema on the response format prevents the model from emitting tool_use-shaped output, and the step degrades: either tools never fire, or the schema is violated, or the response is empty.

The ashlar-pair pattern is the blessed way around it: split the one conceptual step into two ashlars, each of which does one thing the server is good at.

6.9.1 Why the combination breaks on vLLM🔗ℹ

OpenAI-compatible servers running local models typically enforce response_format through a grammar-based decoder (Outlines, LM Format Enforcer, or a vLLM built-in). The decoder masks the model’s output vocabulary at every token step to match the JSON schema. That mask is incompatible with the tool-use content blocks the server would otherwise emit: tool calls have their own structural requirements, and the decoder’s mask forbids the tokens that would begin one.

Concretely, with response-format set and tools attached:

vLLM silently drops the tools and produces schema-valid output without calling anything.
Or the schema decoder and the tool machinery collide and produce malformed output the caller can’t parse — you get a 'llm-parse-failed failure from make-agent-ashlar.
Or the model enters a loop, particularly with reasoning-style models, where it plans a tool call inside a thinking block and never emits it; you burn the turn budget and return 'loop-exhausted.

Anthropic’s API handles the combination because the decoding constraint and the tool machinery are coordinated inside the provider, not bolted on by a grammar decoder at the edge. If you only deploy against Anthropic, you can skip this pattern entirely.

The ashlar-pair pattern isn’t a workaround for a Stone design flaw — it’s how you talk to a server that enforces structure one way at a time.

6.9.2 The pattern: explorer + structurer🔗ℹ

Two ashlars, connected by the DAG:

An explorer ashlar: tools enabled, no #:response-format. Its job is to gather whatever context the step needs by calling tools, and to emit free-form output (a string or a loosely-shaped hash). The explorer may run many turns; it ends when the decide function says to stop.
A structurer ashlar: no tools, #:response-format set to the target schema. Its job is to take the explorer’s output as context and produce exactly the structured hash the rest of the pipeline wants. The structurer typically runs one turn.

Composed with ~>, the pair behaves like a single logical step from the outside:

(define explorer
  (make-agent-ashlar caller
    #:produces 'exploration
    #:queries  '(requirement)
    #:middleware (list (read-file) (list-directory))
    #:system (lambda (dag) "Inspect the filesystem and describe ...")
    #:user (lambda (dag)
             (node-text (dag-nearest-ancestor dag 'requirement)))))

(define structurer
  (make-agent-ashlar caller
    #:produces 'project-config
    #:queries  '(exploration)
    #:max-turns 1
    #:middleware '()
    #:response-format project-config-schema
    #:system (lambda (dag)
               "Return the project config as strict JSON matching the schema.")
    #:user (lambda (dag)
             (node-text (dag-nearest-ancestor dag 'exploration)))))

(define propose-config (~> explorer structurer))

The explorer’s conversation — every tool call, every assistant turn — is preserved on the 'exploration node under the 'conversation key, so the structurer sees the full history when it assembles its prompt. Downstream code that just wants the final 'project-config node reads that and doesn’t care which ashlar produced it.

6.9.3 When to split and when not to🔗ℹ

Split when:

You’re targeting a vLLM / ollama / llama.cpp server and the step needs tools plus a schema.
The step’s responsibilities are genuinely distinct — "gather" and "decide" — and splitting makes the explorer reusable on its own (e.g. its exploration node feeds several downstream structurers).
You want the explorer to run many turns while the structurer stays tightly constrained to schema emission.

Don’t split when:

You’re only deploying against Anthropic — combining tools and response-format in one make-agent-ashlar call is fine there, and the extra ashlar adds latency for no benefit.
The step doesn’t actually need tools. Pass #:response-format on its own and call it a day.
The step doesn’t need a schema. An explorer on its own, whose output is free-form, is the simpler pattern.

6.9.4 Trade-offs🔗ℹ

The pattern has costs that are worth saying out loud.

Two LLM calls instead of one. The explorer gathers, the structurer transforms. On a hosted model charging per token that’s real latency and cost.

Context duplication. The structurer’s prompt has to restate enough of the problem to produce the schema, because it doesn’t share the explorer’s conversation state directly — it sees the explorer’s output as DAG context. In practice, the structurer’s prompt is short ("take this and shape it into the schema") and the explorer’s output is the context.

Two places to tune prompts. A system prompt for the explorer, another for the structurer. They can drift.

The payoff is that each ashlar is doing one thing the server actually supports, and the failure modes are legible: an 'exploration node failed means the explorer couldn’t gather; a 'llm-parse-failed on 'project-config means the structurer couldn’t parse its own input. Both are easier to debug than a silent tool omission or a decoder collision.

6.9.5 Related provider constraints🔗ℹ

The reason the ashlar-pair pattern exists at all is a specific case of a broader phenomenon: different providers expose different subsets of the LLM feature surface, and Stone doesn’t paper over the differences. If you’re deploying against Qwen-family models on vLLM, you’ll also need to disable thinking mode on your caller; see Provider constraints in the explanation section for the full picture.

6.10 Provider constraints🔗ℹ

Stone is provider-agnostic — the same ashlars, the same composition primitives, and the same DAG run whether your caller points at Anthropic, at a vLLM server hosting Qwen, or at an ollama instance running a quantized Llama. Provider-agnostic does not mean provider-uniform. Each server has its own quirks, its own defaults, and its own bugs.

Stone doesn’t hide those. When a specific provider has a known trap, Stone exposes the knob (usually #:extra-body on the caller) and documents the constraint here. This page catalogs what’s currently known and what you need to do about it.

6.10.1 Qwen3.5 on vLLM: thinking-by-default🔗ℹ

Qwen3.5-family models ship with reasoning mode enabled by default: the server emits a <think>...</think> block before the final answer. For free-text prompts that’s fine. For schema-enforced outputs, it’s usually fatal:

The thinking block eats seconds of wall-clock time — often tens of seconds on complex prompts. Stone’s test harness default timeout is 60 seconds, and schema-enforced ashlars regularly blow past it.
When tools are attached, Qwen3.5 tends to plan the tool call inside the thinking block and never emit the actual tool_use content. The turn returns empty. Loop exhausted.

The /no_think soft-switch that earlier Qwen releases supported was removed in 3.5. The only supported escape hatch is chat_template_kwargs: {enable_thinking: false} in the request body. Stone exposes that through #:extra-body on make-openai-caller:

(define caller
  (make-openai-caller
    #:url "http://localhost:8000"
    #:extra-body (hasheq 'chat_template_kwargs
                         (hasheq 'enable_thinking #f))))

The caller closes over #:extra-body at construction time, so every request this caller sends will carry the flag. No per-call plumbing, no per-ashlar override.

If you’re hosting Qwen and don’t want thinking disabled globally — maybe one pipeline needs it for reasoning-heavy tasks — construct two callers and hand each to the ashlars that want it. The caller is a value like any other; having more than one is fine.

See the Configure a caller how-to for the full caller construction recipe, including how to toggle thinking on a per-pipeline basis.

6.10.2 vLLM: schema and tools can’t co-exist in one call🔗ℹ

Enforcing response_format through vLLM’s grammar decoder masks the tokens the model would use to emit tool calls. If you pass a schema and a tool middleware list to the same make-agent-ashlar and your caller points at vLLM, the step will fail in one of three ways: tools never fire, the output is malformed and triggers 'llm-parse-failed, or the turn budget runs out while the model waits to emit a tool call it’s grammatically forbidden from starting.

Anthropic handles the combination. Most OpenAI-compatible servers running open-weight models don’t.

The blessed workaround on Stone is the ashlar-pair pattern: split into an explorer (tools, no schema) and a structurer (schema, no tools). See the Stone-pair pattern explanation for the full discussion, including the explorer/structurer code sketch and when to split versus when not to.

6.10.3 Anthropic: streaming isn’t implemented yet🔗ℹ

The #:outbox argument on make-agent-ashlar forwards per-token events from the caller. The OpenAI-compatible caller implements streaming — it hits the SSE endpoint, parses each token, and emits token-events through the outbox. The Anthropic caller accepts the outbox argument and silently ignores it.

This is a real limitation. If you need per-token events on the Anthropic path — for a streaming TUI, for early-stopping on a marker, for anything time-sensitive — you’ll get the final response as one block when the API responds, not a stream. Turn-level events (turn-event) still fire for both callers.

The reason this is still an open gap rather than a fixed feature is that Anthropic’s streaming format uses content-block deltas whose semantics — particularly around tool-use blocks and thinking blocks — need their own adapter, and that work is a separate design pass. The caller API won’t change when it lands; wiring an outbox to make-anthropic-caller will just start producing events instead of being a no-op.

6.10.4 OpenAI-compatible providers: reasoning controls diverge🔗ℹ

"OpenAI-compatible" covers a lot of ground. Stone’s make-openai-caller works against vLLM, ollama, llama.cpp’s server, LM Studio, LiteLLM, Fireworks, Together, Groq, DeepSeek, Qwen’s DashScope, and OpenAI itself. They all speak something close to the same wire format. They diverge on how to control reasoning.

The field is young enough that no two providers agree. Known variations today:

Qwen3.5 on vLLM: chat_template_kwargs: {enable_thinking: false}.
OpenAI o1/o3: reasoning_effort: "low"|"medium"|"high".
DeepSeek: model-only; pick deepseek-chat for no reasoning, deepseek-reasoner for chain-of-thought.
Fireworks / Together on open-weight reasoning models: provider-specific flags that change across releases.
xAI Grok: nested object with budget.

Stone’s answer is #:extra-body on the caller constructors. Whatever the provider’s current incantation is, pass it as a hash and it rides on every request. Reserved keys (model, messages, max_tokens, tools, response_format, stream) raise at call time if you try to override them through #:extra-body — those fields are load-bearing for Stone itself.

If and when the field converges on a single shape, Stone will add first-class sugar (#:thinking-mode or similar) on top of #:extra-body. Until then, the escape hatch is the supported path.

6.10.5 Tool schemas: Anthropic shape in, provider format out🔗ℹ

When you write make-tool, you declare the tool’s input in Anthropic’s schema shape. The OpenAI-compatible caller converts that to the function schema OpenAI servers expect on the way out; the Anthropic caller sends it through unchanged.

That conversion is mechanical and rarely surprising, but it’s worth knowing when debugging: if a tool fires on Anthropic but not on OpenAI-compatible, check whether your schema uses features the conversion can’t represent (deeply nested oneOf clauses with discriminators, for example, have shape differences between the two formats). Stone’s conversion is conservative — it covers the common cases — and the unusual cases usually work better on whichever side you started from.

6.10.6 What to do when you hit a provider wall🔗ℹ

The shape of a provider constraint is usually the same: a request works in curl against the provider’s API, doesn’t work through Stone. The order of investigation:

Is this a known case on this page? (Thinking mode, schema + tools, streaming.) If so, apply the documented fix.
Can you reproduce the curl-works / Stone-doesn’t shape with a stubbed caller that prints the request body? If the body matches curl and the server still rejects it, the problem is in the header handling or the content-type negotiation — look at the caller source.
Is the server silently coercing a field? (Some OpenAI-compatibles rewrite max_tokens to their own maximum.) The #:extra-body escape hatch is also where the fix goes: override the field at construction time.

The goal of this page isn’t to list every quirk of every provider — that list would be a moving target. It’s to say, honestly, what Stone does and doesn’t paper over, so you can plan around it.

6.11 Harness vs. pipeline🔗ℹ

One disambiguation up front. Stone’s test harness (stone/test) is a unit-testing tool — the thing that exercises a single ashlar against a real or mock caller. The 2026 industry term agent harness means something completely different: the runtime infrastructure around an LLM that turns it into an agent — loop, tools, memory, observability, evals, human-in-the-loop. This document is about the latter. When harness appears below without a qualifier, it means agent harness, never test harness.

A Stone pipeline is an agent harness. The word doesn’t appear in most of the docs because Stone is older than the term’s current usage, and because the framework’s own vocabulary (ashlars, DAG, ~>) does the real semantic work. This document is the bridge: if you arrive via Anthropic’s or LangChain’s or smolagents’ writing, here is the translation table.

6.11.1 What "agent harness" means in 2026🔗ℹ

Across Anthropic-adjacent writing, MongoDB, LangChain, Parallel Web Systems, Firecrawl, and the smolagents docs, the dominant 2026 definition is roughly:

The software infrastructure surrounding an LLM that turns it into an agent — the loop, the tool layer, context/memory management, persistence, permissions, observability, and evals. Everything except the model’s reasoning.

Within that, eight components are named consistently:

Agent loop / orchestration — the perceive→plan→act →verify cycle.
Tool integration — function calling, MCP servers, capability routing.
Memory and state — working memory + persistent state; checkpoint / resume.
Context engineering — compaction, retrieval, summarization.
Observability and tracing — span/event streams, replay, audit.
Permissions and guardrails — allow-lists, sandboxing, approval gates.
Evals — out-of-band testing, regression detection.
Human-in-the-loop — approval channels, clarification asks.

The harness-vs-framework distinction (per Parallel, MongoDB): a framework offers building blocks (LangChain); a harness is an opinionated, mostly-assembled runtime (Claude Code, Codex, Cursor). An agent SDK (Anthropic Agent SDK, OpenAI Agents SDK) sits between — a library specifically for building harnesses. Stone is closer to the SDK position than the harness position: you compose your own harness from ashlars rather than reaching for a pre-assembled one.

6.11.2 Translation table🔗ℹ

Each component of an agent harness has a Stone primitive (or a combination of them) that does the same job:

Harness component		Stone primitive
Agent loop		make-agent-ashlar with #:decide; ashlar-loop for outer-loop control
Tool integration		make-tool as middleware; run-command, read-file, write-file, etc. from stone/tools
Memory and state		the typed, append-only DAG — every ashlar’s output is a queryable node; the conversation list is embedded under 'conversation in the produced node
Context engineering		#:system / #:user builders are functions over the DAG; per-turn message threading is the framework’s job, but compaction / summarization is not shipped
Observability and tracing		stone/logging (events on a dedicated logger) + stone/trace (read JSONL traces) + raco stone trace (CLI)
Permissions and guardrails		#:allowed-paths on tool middleware; the adversary + healer pattern (#:adversary, #:heal-with) for content-level guardrails
Evals		stone/test with with-live-harness / with-mock-harness for live and mock callers
Human-in-the-loop		make-ask-human as a first-class ashlar with channel-based wiring
Declarative composition		~>, ashlar-loop, ashlar-match, parallel — a topology, not a Python class graph
Static validation		validate-pipeline catches missing producers and lens mismatches before any LLM call

6.11.3 Where Stone exceeds typical harnesses🔗ℹ

Three things Stone does that the agent-harness conversation does not generally include:

Declarative ~> composition. Pipelines are values: you build them, validate them, recompose them. They are not Python class graphs you instantiate.
Static validation. validate-pipeline walks the topology and reports missing producers, lens mismatches, and unreachable branches before any LLM call. Most harnesses discover these at runtime.
Typed DAG state. Coordination between ashlars is a named, typed node graph — not implicit "memory" or hidden framework state. When a pipeline produces an unexpected result, you open the DAG and look.

6.11.4 Where Stone is not a deployed harness🔗ℹ

Three things Stone deliberately does not ship that batteries-included harnesses do:

No context compaction. Stone does not summarize or truncate conversation history automatically; long-running agents that need this either build their own ashlar for it or cap turns.
No checkpoint / resume across processes. The DAG is in-process. You can serialize it (it’s all data), but Stone does not ship a checkpoint protocol the way some agent runtimes do.
No opinionated UI. The TUI launcher (raco stone) is a thin wrapper for local testing; if you want a Codex-style chat surface, you build it.

These are deliberate. The "harness construction kit" framing trades batteries-included convenience for the ability to compose, validate, and instrument the topology yourself. If you want a deployed harness in the shape of Claude Code, reach for Claude Code. If you want to assemble a harness shaped to a specific workflow — with the topology visible and validatable — Stone is that tool.

6.11.5 See also🔗ℹ

Why Stone — broader positioning against agentic assistants, pipeline frameworks, and bare SDK calls.
Ashlars — the unit, in depth.
The DAG as Pipeline State — why coordination goes through a typed graph, not in-memory state.
Agents and Tools — the agent loop, tool middleware, adversary, and healer in detail.

1	Why Stone
2	Reading Racket
3	Tutorials
4	How-to guides
5	Reference
6	Explanation

6.1	Ashlars
6.2	The DAG as Pipeline State
6.3	Edge Primitives
6.4	Agents and Tools
6.5	Ask Human
6.6	Validation
6.7	Observability
6.8	The test harness
6.9	The ashlar-pair pattern
6.10	Provider constraints
6.11	Harness vs. pipeline