Extending synterr¶
Synterr has six extension points. Each has a clean protocol contract and an existing example you can copy. Pick the one that matches what you're trying to do.
Decision tree¶
| You want to… | Extension point | Effort |
|---|---|---|
| Generate a new error type within Russian | New handler | Hours |
| Cover a different error taxonomy / corpus annotation | New schema | A day |
| Calibrate to a different learner population | New preset | An hour |
| Use a different morphological/parsing toolkit | New backend | A week |
| Emit a new file format | New output format | An hour |
| Support a new language entirely | New language module | Weeks |
1. Add an error handler¶
The most common extension. You implement the ErrorHandler protocol —
two methods (can_apply, apply) — register it, and add a weight.
class MyHandler:
name = "my_handler"
subtypes = ["my_subtype"]
category = "MORPH" # SPELL / MORPH / PUNCT / OTHER
changes_length = False # True if you add/delete tokens
def can_apply(self, tokens, idx) -> bool: ...
def apply(self, tokens, sentence, idx, modified, rng=None) -> ErrorResult | None: ...
Where to copy from: src/synterr/languages/russian/errors/spelling.py is
the simplest non-trivial example. morphological.py shows confusion-matrix
+ dep-tree integration. comma_insert.py shows length-changing handlers.
Where the contract lives: CLAUDE.md.
Step-by-step (with worked example, in Russian):
docs/CONTRIBUTING.ru.md §"Как добавить новый тип ошибки".
Don't forget:
- Register in src/synterr/languages/russian/errors/__init__.py
- Default weight in src/synterr/configs/russian/rulec.yaml
- Schema mapping in src/synterr/schemas/data/rlc.yaml (and rozental.yaml if applicable)
- Tests under tests/test_languages/test_russian/
2. Add a schema¶
Schemas are YAML files in src/synterr/schemas/data/. The structure:
name: my_schema
description: "What this schema represents"
primary_tags: [Tag1, Tag2, ...] # L0 categories
modifiers: [...] # optional, for compound tags
# Hierarchical structure (L0 → L1 → L2)
tags:
Tag1:
description: "..."
detection_category: SPELL
paras: "§reference"
parent: null
# Map handler subtypes → schema tags
subtype_mappings:
my_subtype:
primary: Tag1
l2_tag: Tag1_specific # optional fine-grained
Where to copy from: src/synterr/schemas/data/rlc.yaml is the simplest;
rozental.yaml is the deepest (3-level hierarchy with §-paragraph traces);
errant.yaml shows POS+operation tags for cross-lingual GEC.
Loading is automatic — drop the YAML in the right directory and it's
discoverable via synterr list-schemas.
3. Add a preset¶
Presets are configs in src/synterr/configs/<language>/. Format:
name: my_preset
description: "Calibrated to ..."
weights:
spelling: 0.4
noun_case: 0.2
# ...
subtype_weights:
spelling:
vowel_reduction: 30
keyboard: 5
confusion_matrices:
case: {...} # optional, language-dependent
use_depparse: true
error_probability: 0.7
max_errors_per_sentence: 3
Where to copy from: src/synterr/configs/russian/rulec.yaml has the
full structure with documentation.
Calibration tip: if you have a labeled error corpus, run
scripts/extract_confusion_matrices.py (or write the equivalent for your
data) to derive matrices empirically. See
docs/research/CASE_CONFUSION_PATTERNS.md
for how the rulec preset's matrices were derived.
4. Add an NLP backend¶
A backend implements the Analyzer protocol — basically analyze(sentence)
→ list[AnalyzedToken]. Synterr ships three for Russian: stanza (default),
natasha, spacy.
class MyBackend:
def __init__(self, use_depparse: bool = False): ...
def analyze(self, sentence: str) -> list[AnalyzedToken]: ...
def analyze_batch(self, sentences: list[str]) -> list[list[AnalyzedToken]]: ...
Where to copy from: src/synterr/languages/russian/backends/stanza_backend.py
is the canonical example. natasha_backend.py is faster but lacks dep parse.
The translation work is: for every token, populate text, lemma, pos
(UD), features dict (UD-style), and optionally dep_rel + head_idx.
Whatever your toolkit calls things, normalize to UD tags here.
Register your backend: add to src/synterr/languages/russian/backends/__init__.py
and the dispatch in LanguageModule.get_analyzer().
5. Add an output format¶
Output formats live in ErrorPipeline result objects. The current set:
gector, tsv, jsonl, chat, sft, diff. Adding one means adding
a to_<format>() method on the result class and a --output-format choice
in the CLI.
Where to copy from: src/synterr/core/pipeline.py (search for
to_jsonl, to_tsv, to_chat).
This is small enough that it usually lives in a single PR.
6. Add a new language¶
The biggest extension. Synterr's architecture is language-agnostic; the core plus schemas don't know about Russian. To add a language you implement:
- Language module (registers the language code, dispatches to backend / handlers)
- Analyzer (tokenization + morphology, optionally dep parse)
- Inflector (turns paradigm features into surface forms)
- Error handlers (per-language; the protocol is universal)
- Optionally: language-specific schema mappings, presets, backends
Where the full procedure lives: CONTRIBUTING.md §"Adding a New
Language".
Step-by-step with a hypothetical German module as the example.
Honest take: this is research-grade work. Russian morphological
infrastructure took months. Don't underestimate the inflection coverage
problem (pymorphy3 is exceptional; equivalent quality for other
languages requires effort). Start with a small handler set
(spelling, paronym, comma_delete) and expand.
What you do not need to extend¶
- The pipeline core,
ErrorPipeline— orchestration is generic - The CLI, mostly — new presets/schemas/handlers are picked up automatically
- The output format dispatch for adding a new error type — only when adding a new format
If you're tempted to fork the pipeline, ask first (Issues). Usually the right answer is a more expressive handler protocol, which we'd rather upstream than fragment.