Extraction Before Ontology: Open-Vocabulary Parsing and Late Canonicalization

Kendall Clark · Pentad Labs · 22 June 2026 · PLRN-015


Abstract

An agent’s memory is only as good as what gets written into it, and what gets written is often decided at ingestion, when natural language is turned into facts. The tempting design fixes a small set of relation labels and makes extraction the act of sorting each sentence into one of them. The design fails on the first interesting sentence. “Marie Curie discovered radium” states a discovery, and if the label set does not contain discovery the extractor either forces the sentence into a label that distorts it or records nothing, and the thing the text actually said is gone before any reasoning has had a chance to use it.

The fault lies in putting ontology (in the sense of Quine: the world-stuff over which one’s variables may range) before the extraction.

WunderOS separates the two jobs:

The extraction path is composed rather than monolithic: a deterministic dependency parse recovers asserted clausal structure, a learned relation model recovers the relations the parse misses, and an embedding gate does the mapping, with no LLM on the hot path. Because the two jobs are separate, they are measured separately: what the text asserts and what reasoning later infers never share a denominator.

The vocabulary of extraction is the language. The vocabulary of reasoning is the ontology. Conflating them discards evidence at write time, before any decision has asked for it.

1. The fixed-label trap

We learned this one the hard way and the scars are worth a moment of reflection.

The initial WunderOS extractor carried its ontology into the extraction step. Its vocabulary was a small closed set of structural relations, on the order of eight: type, is-a, part-of, has-a, causes, caused-by, implies, before-and-after. Every sentence had to be sorted into one of these or marked as carrying no relation.

The set is a reasonable inference ontology. It is a ruinous extraction target.

The ruin is the same one the golden record commits in PLRN-013, displaced from reconciliation to ingestion. To produce one of eight labels the extractor must treat everything outside the eight as error. “Discovered,” “composed,” “acquired,” “filed,” “rebutted”: each is a predicate the text states plainly, and each is outside the set, so each is either bent into a neighbor that means something else or dropped. The information was in the sentence, the extractor was the bottleneck, and the loss happened at write time, before any query had a chance to want the predicate the text used. An extractor that can only emit what its ontology already contains can never record anything its ontology did not anticipate, which for open-ended enterprise text is most of what matters.

The lesson is to let the vocabulary of extraction be the vocabulary of the text. What is stated should be recorded as stated, and the mapping to the system’s preferred forms should be a separate act, performed later, where it can be measured, tuned, and reversed without touching what was extracted.

Extraction is physics; all else is projection, history, and, if you live long enough, the trauma of the limit. Deal.

2. Two jobs, two granules

Extraction and canonicalization are different jobs and they operate on different granules. Extraction reads a span of text and recovers the propositions it asserts: who did what to what, under what circumstance, on whose authority. Its fidelity criterion is whether it recovered the predicate the text used. Canonicalization reads a recovered predicate and projects it to a target the system reasons over: whether “discovered” should be stored as “discovered,” or folded into a domain relation the corpus already knows, or related to a structural atom that carries inference. Its fidelity criterion is whether the projection preserved meaning.

This is the extract-then-canonicalize pattern that recent knowledge-graph construction work has converged on, naming the stages and keeping them apart (Zhang and Soh, Extract-Define-Canonicalize, 2024; Lairgi et al., iText2KG, 2024). The substrate adopts the separation and gives it the property the rest of the system insists on: each stage is measured on its own terms, so that a number reported for extraction is a number about what the text asserts, and a number reported for the canonicalized graph is a number about what the system can reason over, and the two are never averaged into a figure that means neither.

3. The Pentad is the extraction target

Extraction emits Pentads. The Pentad is the substrate’s five-slot fact, (S, P, O, C, L): subject, predicate, object, context, lineage, the structure of PLRN-003 that fell out of the substrate’s binding physics and was found (by others) to agree, slot for slot, with Sanskrit grammarians’ analysis of how an event binds its participants. Extraction fills the slots from the parse.

The subject and object are the doer (subject) and the thing done to (object), the kartā and karma roles, read from the grammatical subject and object of the clause and normalized so that the active and passive statements of one fact land in the same shape. The predicate is the verbal head, lemmatized but not coerced, the open-vocabulary slot where “discovered” survives as discovered. The context slot takes the modifiers the older extractor was blind to, and takes them typed: temporal, locative, conditional, manner, and causal context are recognized from the grammar and attached as distinct kinds rather than concatenated into a string, so that “in 1898” and “in Paris” enter the fact as a time and a place and not as undifferentiated text. The lineage slot is filled by construction with the provenance of PLRN-006: the source span, the transducer version, and the address that lets the fact be re-derived. Recording the parse this way is the discipline of PLRN-009 carried into ingestion: the structure the parser found is kept, not flattened.

4. A composed pipeline, not one model

The extraction path is composed of parts with different strengths, because no single component is best at the whole job and the substrate will not put a model on the hot path where it can possibly avoid one.

The primary path is a deterministic dependency parse (UDPipe, over Universal Dependencies; Straka and Straková, 2017; Nivre et al., Universal Dependencies). It produces a grammatical analysis in a few milliseconds, it is exact and replayable, and it recovers the asserted clausal structure of ordinary declarative text reliably: subject, predicate, object, and the typed modifiers of section 3 are read directly off the parse. For the large fraction of agentic exhaust that states its facts plainly, the deterministic path is both faster and more precise than a learned model, and it leaves a parse that can be audited.

The fallback path is a learned, schema-driven relation extractor of the GLiNER line (Zaratiana et al., GLiNER, 2023). It is slower and it is reserved for the sentences the deterministic parse cannot resolve, where the relation is implicit or spread across clauses in a way grammar alone does not settle. It runs only when the primary path yields nothing, so its cost is paid on the residual rather than on every sentence, and its recall on implicit relations is bought without paying its latency on the common case.

The canonicalization gate is neither a parser nor an extractor but an embedding lookup, and section 5 contains a description. The point of the composition is that each stage does the part it is best at: deterministic grammar for asserted structure, a learned model for the implicit residual, and dense vectors for the mapping to canonical form. None of the three is a general-purpose language model invoked per fact, and the whole path runs without one.

5. Canonicalization as a late, reversible gate

Canonicalization maps a surface predicate to a canonical relation, and it is built so that the open-vocabulary default is preserved whenever the mapping is not confident. The surface predicate is embedded with a small sentence encoder and matched, by cosine distance, against the centroids of a corpus of known relations. If the nearest centroid clears a threshold, the predicate is stored under that centroid’s canonical label; if it does not, the predicate is kept verbatim. “Discovered” maps to a corpus relation when the corpus has a near one and stays “discovered” when it does not, and either way the fact is recorded.

The corpus is layered. A generic corpus of common relations applies everywhere, and a tenant may carry its own corpus whose relations override the generic ones for that tenant’s text, so that a relation that means something particular in one enterprise’s domain canonicalizes to that enterprise’s preferred form without imposing it on anyone else. The lookup is a scan over a small set of centroids, cheap enough to run inline, and it carries no model in the runtime sense: the encoder is a fixed embedding, the centroids are precomputed, and the decision is a distance and a threshold. Because the gate either maps a predicate or keeps it, and never discards it, canonicalization can be re-run with a better corpus without re-extracting, and a mapping can be revised without the original ever having been lost.

6. Asserted and inferred, measured apart

The eight labels of section 1 are not wrong per se; they were simply misplaced. They belong to inference, where a small closed set of relations with known inference rules earns its keep: is-a gives subsumption, part-of gives rollup, before gives temporal order. Returned to inference and kept out of extraction, the closed set does the work it is good at over a graph that was populated by open-vocabulary extraction rather than narrowed to fit it.

This is why the asserted layer and the inferred layer are measured on separate denominators. The asserted layer is what extraction emits: the predicate the text stated, recovered or not. The inferred layer is what reasoning derives: the closed atoms and their consequences. A system that reported one accuracy figure over both would be unable to say whether a wrong answer came from mis-reading the text or from mis-deriving a consequence, and those are different faults with different fixes. Keeping the denominators apart is the same honesty the rest of the substrate keeps about what it has measured.

7. What it buys

The matured spine recovers the predicate the text states, over an open-vocabulary verb gold, at 0.87. The context slot that the prior extractor filled at zero by construction, being blind to modifiers, is now filled wherever the text carries a modifier, and the type assigned to the modifier, temporal against locative against the rest, is correct on held-out text at 0.90. These are extraction numbers, reported against what the text asserts; they are not the inference numbers, which are reported elsewhere against a different denominator.

The remaining error is catalogued rather than smoothed over. Ambiguous prepositions defeat the deterministic mapping where English itself is ambiguous: “by” and “over” can mark an agent or an instrument, “with” can mark a companion or a means, and the parse cannot always tell which. Proper nouns occasionally present as temporal expressions and are mistyped. A wrong root in the parse propagates into a wrong subject. These are named ceilings on the deterministic path, the places where the learned fallback and the later context work earn their keep, and they are tracked as known limits rather than reported as solved.

8. What this does not give yet

The stage canonicalizes predicates, the edges of the graph, and not yet entities, its nodes. Two mentions of one entity under different surface names are not yet folded together by this work; node-side canonicalization is named and unbuilt. The learned fallback runs today as offline tooling rather than in the runtime, and bringing it in-process is the named follow-up that makes the composed path of section 4 a single deployed pipeline rather than a primary path with an external assist.

The admission of a new domain relation into a tenant’s corpus, the moment a recurring surface predicate becomes a canonical target, is governed by a frequency criterion still being settled, and is called out here as open. And the verification gates that would check an extracted fact against entailment bounds and ontology constraints sit downstream of this spine and are not yet shipped; they presuppose the clean separation this note describes, which is part of why the separation came first.

The extract-then-canonicalize separation is the recent knowledge-graph construction line: Zhang and Soh’s Extract-Define-Canonicalize (2024), which names the stages, and Lairgi et al.’s iText2KG (2024), which builds the graph incrementally over a preserved log; the canonicalization-by-embedding step has its lineage in open knowledge-base canonicalization by representation clustering (Shen et al., 2022). The deterministic parser is UDPipe (Straka and Straková, 2017) over Universal Dependencies (Nivre et al.); the learned fallback is of the GLiNER line (Zaratiana et al., 2023). The five-slot target and its agreement with the Sanskrit kāraka analysis is PLRN-003.

Within the PLRN line, the note rests on the recorded-parse discipline of PLRN-009, the provenance-by-construction of PLRN-006 that fills the lineage slot, and the separation-of-write-time-from-decision-time argument of PLRN-013, of which the fixed-label trap is the ingestion-time instance. What this note adds is the claim that extraction must precede ontology: that the vocabulary of what gets recorded should be the vocabulary of the text, and the mapping to the vocabulary of reasoning should be a separate, measured, reversible stage that runs afterward.

A note on method

Written in conversation with Claude Opus 4.8 (Anthropic) as structured interlocutor and prose editor. The research backstop was assembled in part with Paper Lantern. The motives, ideas, claims, framing, and architectural commitments are mine.

Kendall Clark · k@pentad.ai
Great Falls, Virginia
June 2026