CLI reference¶
Every synterr command with its full help text. Re-run uv run synterr <command> --help to see the same thing locally.
synterr¶
Usage: synterr [OPTIONS] COMMAND [ARGS]...
Synterr - Reproducible error generation for GEC.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
analyze Analyze a sentence (debug mode).
analyze-distribution Analyze M2 files to extract error distribution.
audit-jsonl Quality audit (no-ops, non-words) for a GEC SFT...
classify-jsonl Distribution-by-rule report for a GEC SFT JSONL.
corrupt Apply a specific error to a sentence.
coverage Show schema coverage by available handlers.
generate Generate synthetic errors from corpus.
generate-bea-paper Force-apply errors per LoRuGEC rule for SFT...
generate-targeted Force-apply errors per LoRuGEC rule for SFT...
list-backends List available NLP backends.
list-errors List error types for a language.
list-languages List available languages.
list-presets List available presets for a language.
list-schemas List available linguistic schemas.
mine-pools Mine per-error-class sentence pools from large...
survey Survey per-subtype fire rates of all handlers...
synterr analyze¶
Usage: synterr synterr analyze [OPTIONS] TEXT
Analyze a sentence (debug mode).
Options:
-l, --lang TEXT Language code [required]
-b, --backend TEXT NLP backend (stanza, natasha, spacy)
--depparse / --no-depparse Enable dependency parsing
--help Show this message and exit.
synterr analyze-distribution¶
Usage: synterr synterr analyze-distribution [OPTIONS] M2_FILES...
Analyze M2 files to extract error distribution.
Accepts one or more M2 format files (e.g., RULEC-GEC.dev.m2, GERA.train.m2).
Outputs error type frequencies and suggested synterr weights.
Options:
-o, --output PATH Output JSON file for weights
--help Show this message and exit.
synterr audit-jsonl¶
Usage: synterr synterr audit-jsonl [OPTIONS] PATH
Quality audit (no-ops, non-words) for a GEC SFT JSONL.
Flags:
no_op — src == tgt (handler didn't actually corrupt)
non_word — corrupted token isn't in pymorphy3's dictionary
Run on synterr output before training to catch handler bugs at scale; run on
third-party datasets to spot quality issues.
Options:
--samples INTEGER Sample size per issue type
--no-morphology Skip the non-word check (faster, no pymorphy3 lookup)
--help Show this message and exit.
synterr classify-jsonl¶
Usage: synterr synterr classify-jsonl [OPTIONS] PATH
Distribution-by-rule report for a GEC SFT JSONL.
If records contain a `rule` field, counts by rule.
Otherwise, buckets by edit type (replace / insert / delete / multi).
Useful for auditing third-party GEC datasets — see what's actually in there
before training on it.
Options:
--top INTEGER Show top N entries
--help Show this message and exit.
synterr corrupt¶
Usage: synterr synterr corrupt [OPTIONS] TEXT
Apply a specific error to a sentence.
Tagged corruption: apply exactly one error of the specified type.
Error specifier formats:
spelling - any spelling error (all subtypes)
spelling:vowel_reduction - only vowel_reduction subtype
Ortho --schema rlc - all subtypes mapped to Ortho tag
Examples:
# Any spelling error
synterr corrupt -l ru -e spelling "Молоко стоит на столе."
# Only vowel reduction (phonetic)
synterr corrupt -l ru -e spelling:vowel_reduction "Молоко стоит на столе."
# Only typos (keyboard errors)
synterr corrupt -l ru -e spelling:keyboard "Привет мир."
# All Ortho-mapped subtypes (phonetic errors, no typos)
synterr corrupt -l ru -e Ortho --schema rlc "Молоко стоит на столе."
# Schema tag for case errors
synterr corrupt -l ru -e Gov --schema rlc "Мама мыла раму."
Options:
-l, --lang TEXT Language code [required]
-e, --error TEXT Error specifier: handler, handler:subtype, or schema
tag [required]
-p, --position INTEGER Token position (0-indexed, random if omitted)
-b, --backend TEXT NLP backend (stanza, natasha, spacy)
-s, --schema TEXT Schema for tag lookup (e.g., rlc)
--depparse Enable dependency parsing (required for noun_case,
adj_case, verb_person_number — slower)
--seed INTEGER Random seed
--help Show this message and exit.
synterr coverage¶
Usage: synterr synterr coverage [OPTIONS]
Show schema coverage by available handlers.
Reports which schema tags are covered by the language's error handlers.
Examples:
synterr coverage --lang ru --schema rlc
Options:
-l, --lang TEXT Language code (e.g., ru) [required]
-s, --schema TEXT Schema name (e.g., synterr, rlc) [required]
--help Show this message and exit.
synterr generate¶
Usage: synterr synterr generate [OPTIONS]
Generate synthetic errors from corpus.
Configuration priority:
--config > --preset > --weights > language default
Examples:
synterr generate -l ru --preset rulec -i corpus.txt -o out.edits
synterr generate -l ru --preset balanced --depparse -i in.txt -o out.jsonl -f jsonl
synterr generate -l ru -e spelling -w '{"spelling": 0.7}' -i in.txt -o out.edits
Options:
-l, --lang TEXT Language code [required]
-i, --input PATH Input corpus (one sentence per line)
[required]
-o, --output PATH Output file [required]
-b, --backend TEXT NLP backend (stanza, natasha, spacy)
-p, --preset TEXT Use preset config (e.g., rulec, gera,
balanced)
-c, --config PATH Custom YAML config
--schema TEXT Linguistic schema (synterr, rlc, or path to
YAML)
-e, --errors TEXT Comma-separated error types (default: all)
-w, --weights TEXT JSON weights dict, e.g., '{"spelling": 0.5}'
-s, --seed INTEGER Random seed
-n, --max-sentences INTEGER Maximum sentences to process
--label-format [original|binary|multiclass]
Output label format
--error-prob FLOAT Probability of introducing errors (0-1)
--depparse / --no-depparse Enable dependency parsing
--batch-size INTEGER Batch size for processing
-f, --output-format [gector|tsv|jsonl|chat|sft]
Output format: gector (token tags), tsv
(src\ttgt), jsonl (rich JSON), chat
(instruction-tuning), sft ({src,tgt} JSONL)
--system-prompt TEXT System prompt for chat format (default:
built-in GEC prompt)
--help Show this message and exit.
synterr generate-bea-paper¶
Usage: synterr synterr generate-bea-paper [OPTIONS]
Force-apply errors per LoRuGEC rule for SFT training.
Generates {"src": corrupted, "tgt": clean, "rule": rule_name} JSONL.
Targets 48 LoRuGEC evaluation rules with bidirectional split/merge.
Saves a .dist.json sidecar with per-rule counts.
Options:
-l, --lang TEXT Language code
-i, --input PATH Input sentences [required]
-o, --output PATH Output JSONL [required]
-n, --total INTEGER Target total examples
--seed INTEGER Random seed
--depparse / --no-depparse Enable dep parsing
--max-input INTEGER Max input sentences to read
--batch-size INTEGER Stanza analysis batch size
--balance-directions / --no-balance-directions
Cap split/merge pairs to min(split, merge)
--help Show this message and exit.
synterr generate-targeted¶
Usage: synterr synterr generate-targeted [OPTIONS]
Force-apply errors per LoRuGEC rule for SFT training.
Generates {"src": corrupted, "tgt": clean, "rule": rule_name} JSONL.
Targets 48 LoRuGEC evaluation rules with bidirectional split/merge.
Saves a .dist.json sidecar with per-rule counts.
Options:
-l, --lang TEXT Language code
-i, --input PATH Input sentences [required]
-o, --output PATH Output JSONL [required]
-n, --total INTEGER Target total examples
--seed INTEGER Random seed
--depparse / --no-depparse Enable dep parsing
--max-input INTEGER Max input sentences to read
--batch-size INTEGER Stanza analysis batch size
--balance-directions / --no-balance-directions
Cap split/merge pairs to min(split, merge)
--help Show this message and exit.
synterr list-backends¶
Usage: synterr synterr list-backends [OPTIONS]
List available NLP backends.
Options:
-l, --lang TEXT Language code (default: ru)
--help Show this message and exit.
synterr list-errors¶
Usage: synterr synterr list-errors [OPTIONS]
List error types for a language.
Options:
-l, --lang TEXT Language code (e.g., ru) [required]
-p, --preset TEXT Show weights from preset (default: language default)
--help Show this message and exit.
synterr list-languages¶
Usage: synterr synterr list-languages [OPTIONS]
List available languages.
Options:
--help Show this message and exit.
synterr list-presets¶
Usage: synterr synterr list-presets [OPTIONS]
List available presets for a language.
Options:
-l, --lang TEXT Language code (e.g., ru) [required]
--help Show this message and exit.
synterr list-schemas¶
Usage: synterr synterr list-schemas [OPTIONS]
List available linguistic schemas.
Options:
--help Show this message and exit.
synterr mine-pools¶
Usage: synterr synterr mine-pools [OPTIONS]
Mine per-error-class sentence pools from large text sources.
Sweeps the sources with surface patterns (derived from the live
handler lexicons where possible) and reservoir-samples up to CAP
candidate sentences per class into OUTDIR/<class>.txt. Candidates
are recall-oriented: the handler's can_apply does the precise
filtering at generation time.
Options:
-s, --source FILE Text source (one sentence per line); repeatable
[required]
-o, --outdir DIRECTORY Pool output directory
--cap INTEGER Max sentences per class
--seed INTEGER
--help Show this message and exit.
synterr survey¶
Usage: synterr synterr survey [OPTIONS]
Survey per-subtype fire rates of all handlers over a corpus.
Reports emissions per 1k sentences for every error subtype, plus
two actionable lists: STARVING (below threshold) and NEVER FIRED.
Feed those to `synterr mine-pools` to build targeted source pools.
Options:
-l, --lang TEXT Language code
-i, --input FILE Text file, one sentence per line [required]
-n, --limit INTEGER Max sentences
--tries INTEGER apply() attempts per applicable token
--starving-below FLOAT Flag subtypes below this many emissions per 1k
sentences
-o, --output PATH JSON report path
--seed INTEGER
--help Show this message and exit.