Getting started¶
Install¶
This pulls synterr plus stanza and pymorphy3 (Russian backend dependencies). For development setup, see Contributing.
Three ways to use synterr¶
1. Corrupt a single sentence (testing / inspection)¶
For dep-tree-aware errors (noun_case, adj_case,
verb_person_number), pass --depparse:
You can target a specific subtype:
2. Generate a corpus of errors¶
Output formats are switched with -f:
| Flag | Format |
|---|---|
gector (default) |
GECToR token-level tags |
tsv |
parallel src\ttgt |
jsonl |
rich JSON, includes rule labels and metadata |
chat |
instruction-tuning chat format |
sft |
{src, tgt} JSONL |
Example with rule-labeled JSONL:
For rule-targeted SFT generation (force-apply each LoRuGEC rule, direction-balanced, paper-style output):
uv run synterr generate-targeted -i corpus.txt -o train.jsonl \
-n 50000 --seed 42 --balance-directions
This produces {"src": corrupted, "tgt": clean, "rule": rule_name}
JSONL plus a .dist.json sidecar with per-rule counts. This is the
command used to build the v4 dataset for our BEA 2026 paper.
3. Use the Python API¶
from synterr.core.pipeline import ErrorPipeline, GenerationConfig
from synterr.core.registry import get_language
config = GenerationConfig(seed=42, use_depparse=True, schema="rozental")
pipeline = ErrorPipeline(get_language("ru"), config)
result = pipeline.generate("Мама мыла раму")
print(result.formatted) # GECToR tags
print(result.to_jsonl()) # rich JSON with rule labels
Choosing a preset¶
| Preset | Use when |
|---|---|
rulec |
Calibrated to RULEC-GEC L2 / heritage learner distribution |
gera |
Calibrated to GERA German-Russian learner distribution |
balanced |
Equal weights across error types |
lorugec |
Coverage-mode, designed for the LoRuGEC benchmark |
profile_punct, profile_spelling, profile_morph |
Single-category isolation, useful for ablations |
What's next¶
- The full contract — text in, tagged errors out, including corpus surveying and pool mining for rare error classes: Pipeline
- Learn the design: Architecture
- See every error type with examples: Error types
- Reproduce paper data exactly: Reproducibility