Getting started¶

Install¶

pip install "synterr[russian] @ git+https://github.com/synterr-nlp/synterr"

This pulls synterr plus stanza and pymorphy3 (Russian backend dependencies). For development setup, see Contributing.

Three ways to use synterr¶

1. Corrupt a single sentence (testing / inspection)¶

uv run synterr corrupt -l ru -e spelling "Молоко стоит на столе."

For dep-tree-aware errors (noun_case, adj_case, verb_person_number), pass --depparse:

uv run synterr corrupt -l ru -e noun_case --depparse \
    "Книга лежит на столе."

You can target a specific subtype:

uv run synterr corrupt -l ru -e spelling:vowel_reduction "Молоко стоит на столе."

2. Generate a corpus of errors¶

uv run synterr generate -l ru --preset rulec \
    -i clean.txt -o train.edits

Output formats are switched with -f:

Flag	Format
`gector` (default)	GECToR token-level tags
`tsv`	parallel `src\ttgt`
`jsonl`	rich JSON, includes rule labels and metadata
`chat`	instruction-tuning chat format
`sft`	`{src, tgt}` JSONL

Example with rule-labeled JSONL:

uv run synterr generate -l ru --preset rulec --depparse \
    -i clean.txt -o train.jsonl -f jsonl

For rule-targeted SFT generation (force-apply each LoRuGEC rule, direction-balanced, paper-style output):

uv run synterr generate-targeted -i corpus.txt -o train.jsonl \
    -n 50000 --seed 42 --balance-directions

This produces {"src": corrupted, "tgt": clean, "rule": rule_name} JSONL plus a .dist.json sidecar with per-rule counts. This is the command used to build the v4 dataset for our BEA 2026 paper.

3. Use the Python API¶

from synterr.core.pipeline import ErrorPipeline, GenerationConfig
from synterr.core.registry import get_language

config = GenerationConfig(seed=42, use_depparse=True, schema="rozental")
pipeline = ErrorPipeline(get_language("ru"), config)

result = pipeline.generate("Мама мыла раму")
print(result.formatted)   # GECToR tags
print(result.to_jsonl())  # rich JSON with rule labels

Choosing a preset¶

Preset	Use when
`rulec`	Calibrated to RULEC-GEC L2 / heritage learner distribution
`gera`	Calibrated to GERA German-Russian learner distribution
`balanced`	Equal weights across error types
`lorugec`	Coverage-mode, designed for the LoRuGEC benchmark
`profile_punct`, `profile_spelling`, `profile_morph`	Single-category isolation, useful for ablations

uv run synterr list-presets -l ru

What's next¶

The full contract — text in, tagged errors out, including corpus surveying and pool mining for rare error classes: Pipeline
Learn the design: Architecture
See every error type with examples: Error types
Reproduce paper data exactly: Reproducibility