Skip to content

CLI reference

Every synterr command with its full help text. Re-run uv run synterr <command> --help to see the same thing locally.

synterr

Usage: synterr [OPTIONS] COMMAND [ARGS]...

  Synterr - Reproducible error generation for GEC.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  analyze               Analyze a sentence (debug mode).
  analyze-distribution  Analyze M2 files to extract error distribution.
  audit-jsonl           Quality audit (no-ops, non-words) for a GEC SFT...
  classify-jsonl        Distribution-by-rule report for a GEC SFT JSONL.
  corrupt               Apply a specific error to a sentence.
  coverage              Show schema coverage by available handlers.
  generate              Generate synthetic errors from corpus.
  generate-bea-paper    Force-apply errors per LoRuGEC rule for SFT...
  generate-targeted     Force-apply errors per LoRuGEC rule for SFT...
  list-backends         List available NLP backends.
  list-errors           List error types for a language.
  list-languages        List available languages.
  list-presets          List available presets for a language.
  list-schemas          List available linguistic schemas.
  mine-pools            Mine per-error-class sentence pools from large...
  survey                Survey per-subtype fire rates of all handlers...

synterr analyze

Usage: synterr synterr analyze [OPTIONS] TEXT

  Analyze a sentence (debug mode).

Options:
  -l, --lang TEXT             Language code  [required]
  -b, --backend TEXT          NLP backend (stanza, natasha, spacy)
  --depparse / --no-depparse  Enable dependency parsing
  --help                      Show this message and exit.

synterr analyze-distribution

Usage: synterr synterr analyze-distribution [OPTIONS] M2_FILES...

  Analyze M2 files to extract error distribution.

  Accepts one or more M2 format files (e.g., RULEC-GEC.dev.m2, GERA.train.m2).
  Outputs error type frequencies and suggested synterr weights.

Options:
  -o, --output PATH  Output JSON file for weights
  --help             Show this message and exit.

synterr audit-jsonl

Usage: synterr synterr audit-jsonl [OPTIONS] PATH

  Quality audit (no-ops, non-words) for a GEC SFT JSONL.

  Flags:
    no_op    — src == tgt (handler didn't actually corrupt)
    non_word — corrupted token isn't in pymorphy3's dictionary

  Run on synterr output before training to catch handler bugs at scale; run on
  third-party datasets to spot quality issues.

Options:
  --samples INTEGER  Sample size per issue type
  --no-morphology    Skip the non-word check (faster, no pymorphy3 lookup)
  --help             Show this message and exit.

synterr classify-jsonl

Usage: synterr synterr classify-jsonl [OPTIONS] PATH

  Distribution-by-rule report for a GEC SFT JSONL.

  If records contain a `rule` field, counts by rule.
  Otherwise, buckets by edit type (replace / insert / delete / multi).

  Useful for auditing third-party GEC datasets — see what's actually in there
  before training on it.

Options:
  --top INTEGER  Show top N entries
  --help         Show this message and exit.

synterr corrupt

Usage: synterr synterr corrupt [OPTIONS] TEXT

  Apply a specific error to a sentence.

  Tagged corruption: apply exactly one error of the specified type.

  Error specifier formats:

      spelling              - any spelling error (all subtypes)
      spelling:vowel_reduction - only vowel_reduction subtype
      Ortho --schema rlc    - all subtypes mapped to Ortho tag

  Examples:

      # Any spelling error
      synterr corrupt -l ru -e spelling "Молоко стоит на столе."

      # Only vowel reduction (phonetic)
      synterr corrupt -l ru -e spelling:vowel_reduction "Молоко стоит на столе."

      # Only typos (keyboard errors)
      synterr corrupt -l ru -e spelling:keyboard "Привет мир."

      # All Ortho-mapped subtypes (phonetic errors, no typos)
      synterr corrupt -l ru -e Ortho --schema rlc "Молоко стоит на столе."

      # Schema tag for case errors
      synterr corrupt -l ru -e Gov --schema rlc "Мама мыла раму."

Options:
  -l, --lang TEXT         Language code  [required]
  -e, --error TEXT        Error specifier: handler, handler:subtype, or schema
                          tag  [required]
  -p, --position INTEGER  Token position (0-indexed, random if omitted)
  -b, --backend TEXT      NLP backend (stanza, natasha, spacy)
  -s, --schema TEXT       Schema for tag lookup (e.g., rlc)
  --depparse              Enable dependency parsing (required for noun_case,
                          adj_case, verb_person_number — slower)
  --seed INTEGER          Random seed
  --help                  Show this message and exit.

synterr coverage

Usage: synterr synterr coverage [OPTIONS]

  Show schema coverage by available handlers.

  Reports which schema tags are covered by the language's error handlers.

  Examples:

    synterr coverage --lang ru --schema rlc

Options:
  -l, --lang TEXT    Language code (e.g., ru)  [required]
  -s, --schema TEXT  Schema name (e.g., synterr, rlc)  [required]
  --help             Show this message and exit.

synterr generate

Usage: synterr synterr generate [OPTIONS]

  Generate synthetic errors from corpus.

  Configuration priority:
    --config > --preset > --weights > language default

  Examples:
    synterr generate -l ru --preset rulec -i corpus.txt -o out.edits
    synterr generate -l ru --preset balanced --depparse -i in.txt -o out.jsonl -f jsonl
    synterr generate -l ru -e spelling -w '{"spelling": 0.7}' -i in.txt -o out.edits

Options:
  -l, --lang TEXT                 Language code  [required]
  -i, --input PATH                Input corpus (one sentence per line)
                                  [required]
  -o, --output PATH               Output file  [required]
  -b, --backend TEXT              NLP backend (stanza, natasha, spacy)
  -p, --preset TEXT               Use preset config (e.g., rulec, gera,
                                  balanced)
  -c, --config PATH               Custom YAML config
  --schema TEXT                   Linguistic schema (synterr, rlc, or path to
                                  YAML)
  -e, --errors TEXT               Comma-separated error types (default: all)
  -w, --weights TEXT              JSON weights dict, e.g., '{"spelling": 0.5}'
  -s, --seed INTEGER              Random seed
  -n, --max-sentences INTEGER     Maximum sentences to process
  --label-format [original|binary|multiclass]
                                  Output label format
  --error-prob FLOAT              Probability of introducing errors (0-1)
  --depparse / --no-depparse      Enable dependency parsing
  --batch-size INTEGER            Batch size for processing
  -f, --output-format [gector|tsv|jsonl|chat|sft]
                                  Output format: gector (token tags), tsv
                                  (src\ttgt), jsonl (rich JSON), chat
                                  (instruction-tuning), sft ({src,tgt} JSONL)
  --system-prompt TEXT            System prompt for chat format (default:
                                  built-in GEC prompt)
  --help                          Show this message and exit.

synterr generate-bea-paper

Usage: synterr synterr generate-bea-paper [OPTIONS]

  Force-apply errors per LoRuGEC rule for SFT training.

  Generates {"src": corrupted, "tgt": clean, "rule": rule_name} JSONL.
  Targets 48 LoRuGEC evaluation rules with bidirectional split/merge.
  Saves a .dist.json sidecar with per-rule counts.

Options:
  -l, --lang TEXT                 Language code
  -i, --input PATH                Input sentences  [required]
  -o, --output PATH               Output JSONL  [required]
  -n, --total INTEGER             Target total examples
  --seed INTEGER                  Random seed
  --depparse / --no-depparse      Enable dep parsing
  --max-input INTEGER             Max input sentences to read
  --batch-size INTEGER            Stanza analysis batch size
  --balance-directions / --no-balance-directions
                                  Cap split/merge pairs to min(split, merge)
  --help                          Show this message and exit.

synterr generate-targeted

Usage: synterr synterr generate-targeted [OPTIONS]

  Force-apply errors per LoRuGEC rule for SFT training.

  Generates {"src": corrupted, "tgt": clean, "rule": rule_name} JSONL.
  Targets 48 LoRuGEC evaluation rules with bidirectional split/merge.
  Saves a .dist.json sidecar with per-rule counts.

Options:
  -l, --lang TEXT                 Language code
  -i, --input PATH                Input sentences  [required]
  -o, --output PATH               Output JSONL  [required]
  -n, --total INTEGER             Target total examples
  --seed INTEGER                  Random seed
  --depparse / --no-depparse      Enable dep parsing
  --max-input INTEGER             Max input sentences to read
  --batch-size INTEGER            Stanza analysis batch size
  --balance-directions / --no-balance-directions
                                  Cap split/merge pairs to min(split, merge)
  --help                          Show this message and exit.

synterr list-backends

Usage: synterr synterr list-backends [OPTIONS]

  List available NLP backends.

Options:
  -l, --lang TEXT  Language code (default: ru)
  --help           Show this message and exit.

synterr list-errors

Usage: synterr synterr list-errors [OPTIONS]

  List error types for a language.

Options:
  -l, --lang TEXT    Language code (e.g., ru)  [required]
  -p, --preset TEXT  Show weights from preset (default: language default)
  --help             Show this message and exit.

synterr list-languages

Usage: synterr synterr list-languages [OPTIONS]

  List available languages.

Options:
  --help  Show this message and exit.

synterr list-presets

Usage: synterr synterr list-presets [OPTIONS]

  List available presets for a language.

Options:
  -l, --lang TEXT  Language code (e.g., ru)  [required]
  --help           Show this message and exit.

synterr list-schemas

Usage: synterr synterr list-schemas [OPTIONS]

  List available linguistic schemas.

Options:
  --help  Show this message and exit.

synterr mine-pools

Usage: synterr synterr mine-pools [OPTIONS]

  Mine per-error-class sentence pools from large text sources.

  Sweeps the sources with surface patterns (derived from the live
  handler lexicons where possible) and reservoir-samples up to CAP
  candidate sentences per class into OUTDIR/<class>.txt. Candidates
  are recall-oriented: the handler's can_apply does the precise
  filtering at generation time.

Options:
  -s, --source FILE       Text source (one sentence per line); repeatable
                          [required]
  -o, --outdir DIRECTORY  Pool output directory
  --cap INTEGER           Max sentences per class
  --seed INTEGER
  --help                  Show this message and exit.

synterr survey

Usage: synterr synterr survey [OPTIONS]

  Survey per-subtype fire rates of all handlers over a corpus.

  Reports emissions per 1k sentences for every error subtype, plus
  two actionable lists: STARVING (below threshold) and NEVER FIRED.
  Feed those to `synterr mine-pools` to build targeted source pools.

Options:
  -l, --lang TEXT         Language code
  -i, --input FILE        Text file, one sentence per line  [required]
  -n, --limit INTEGER     Max sentences
  --tries INTEGER         apply() attempts per applicable token
  --starving-below FLOAT  Flag subtypes below this many emissions per 1k
                          sentences
  -o, --output PATH       JSON report path
  --seed INTEGER
  --help                  Show this message and exit.