Skip to content

Architecture

This page is the conceptual orientation. For full developer reference (adding a handler, the data flow, gotchas) see the contributing guide and the Russian-language deep dive in CONTRIBUTING.ru.md.

The three-layer separation

Handlers ──── how to corrupt
Schemas ───── what to call the error
Configs ───── how often each error fires

This separation is the most important thing to internalize. The same handler (e.g. NounCaseErrorHandler) can be tagged differently under different schemas (RLC's Gov, ERRANT's NOUN:CASE:R, Rozental's mo_noun_case_other), and weighted differently under different configs (rulec, gera, balanced).

You add new error logic in handlers. You change the taxonomy in schemas. You change the distribution in configs. The three rarely need to be edited together.

The pipeline

clean text
analyzer  ─── tokenization, morphological tags, optional dep parse
sample handler  ── weighted by config preset
handler.can_apply()  ── per-token check (POS, dep_rel, features)
handler.apply()  ── return ErrorResult with corrupted form + label
formatter  ── GECToR tags / TSV / JSONL / chat / sft

Handler protocol

Every error handler implements this minimal contract:

class MyHandler:
    name = "my_handler"
    subtypes = ["my_subtype"]   # for schema mapping
    category = "OTHER"           # SPELL / MORPH / PUNCT / OTHER
    changes_length = False       # True if adds/deletes tokens

    def can_apply(self, tokens, idx) -> bool:
        ...

    def apply(self, tokens, sentence, idx, modified, rng=None) -> ErrorResult | None:
        ...

changes_length=True handlers (insertions, deletions) are applied last so they don't shift token indices for other handlers.

Dep-tree-driven generation

A distinguishing design choice: handlers that generate agreement, government, and punctuation errors use dependency-tree heuristics rather than position heuristics.

  • Agreement handlers (adj_*) traverse the amod arc to the head noun and use the head's features as the reference for confusion-matrix lookup.
  • VerbPersonNumberHandler finds the nsubj dependent and uses the subject's number as reference.
  • NounCaseErrorHandler only fires on governed positions (obl, nmod, iobj, obj) — true government errors.
  • The punctuation classifier inspects the head's dep_rel to distinguish subordinate / compound / parenthetical / isolation / homogeneous comma contexts.

This requires use_depparse=True (slower but linguistically grounded).

Confusion matrices

Morphological handlers use empirical confusion matrices derived from RLC (Russian Learner Corpus) for weighted grammeme substitution rather than uniform random selection. See docs/research/CASE_CONFUSION_PATTERNS.md for the analysis.

Multiple schemas

Synterr ships four schemas:

Schema Granularity Use case
synterr (default) Native handler subtype tags Direct rule tracing in your own pipeline
rlc 35 tags Russian Learner Corpus annotation alignment
rozental 8 / 29 / 100 tags (L0 / L1 / L2 hierarchy) Rule-grounded error tracing
errant ERRANT-style POS:operation tags Cross-lingual GEC eval alignment

When generating data, the same corruption gets the right tag for whichever schema you ask for. This is what makes synterr's output useful as both training data and evaluation reference.