Architecture¶
This page is the conceptual orientation. For full developer reference
(adding a handler, the data flow, gotchas) see the contributing guide
and the Russian-language deep dive in
CONTRIBUTING.ru.md.
The three-layer separation¶
Handlers ──── how to corrupt
│
▼
Schemas ───── what to call the error
│
▼
Configs ───── how often each error fires
This separation is the most important thing to internalize. The same
handler (e.g. NounCaseErrorHandler) can be tagged differently
under different schemas (RLC's Gov, ERRANT's NOUN:CASE:R,
Rozental's mo_noun_case_other), and weighted differently under
different configs (rulec, gera, balanced).
You add new error logic in handlers. You change the taxonomy in schemas. You change the distribution in configs. The three rarely need to be edited together.
The pipeline¶
clean text
│
▼
analyzer ─── tokenization, morphological tags, optional dep parse
│
▼
sample handler ── weighted by config preset
│
▼
handler.can_apply() ── per-token check (POS, dep_rel, features)
│
▼
handler.apply() ── return ErrorResult with corrupted form + label
│
▼
formatter ── GECToR tags / TSV / JSONL / chat / sft
Handler protocol¶
Every error handler implements this minimal contract:
class MyHandler:
name = "my_handler"
subtypes = ["my_subtype"] # for schema mapping
category = "OTHER" # SPELL / MORPH / PUNCT / OTHER
changes_length = False # True if adds/deletes tokens
def can_apply(self, tokens, idx) -> bool:
...
def apply(self, tokens, sentence, idx, modified, rng=None) -> ErrorResult | None:
...
changes_length=True handlers (insertions, deletions) are applied
last so they don't shift token indices for other handlers.
Dep-tree-driven generation¶
A distinguishing design choice: handlers that generate agreement, government, and punctuation errors use dependency-tree heuristics rather than position heuristics.
- Agreement handlers (
adj_*) traverse theamodarc to the head noun and use the head's features as the reference for confusion-matrix lookup. VerbPersonNumberHandlerfinds thensubjdependent and uses the subject's number as reference.NounCaseErrorHandleronly fires on governed positions (obl,nmod,iobj,obj) — true government errors.- The punctuation classifier inspects the head's
dep_relto distinguish subordinate / compound / parenthetical / isolation / homogeneous comma contexts.
This requires use_depparse=True (slower but linguistically grounded).
Confusion matrices¶
Morphological handlers use empirical confusion matrices derived from
RLC (Russian Learner Corpus) for weighted grammeme substitution rather
than uniform random selection. See
docs/research/CASE_CONFUSION_PATTERNS.md
for the analysis.
Multiple schemas¶
Synterr ships four schemas:
| Schema | Granularity | Use case |
|---|---|---|
synterr (default) |
Native handler subtype tags | Direct rule tracing in your own pipeline |
rlc |
35 tags | Russian Learner Corpus annotation alignment |
rozental |
8 / 29 / 100 tags (L0 / L1 / L2 hierarchy) | Rule-grounded error tracing |
errant |
ERRANT-style POS:operation tags | Cross-lingual GEC eval alignment |
When generating data, the same corruption gets the right tag for whichever schema you ask for. This is what makes synterr's output useful as both training data and evaluation reference.