Skip to content

Output formats

Synterr emits corrupted text in several formats. CLI selection is via the -f/--output-format flag on synterr generate.

CLI formats

Flag When to use
gector GECToR token-level edit tags (default). Compatible with the GECToR training format.
tsv Parallel src\ttgt lines. Compatible with most seq2seq pipelines.
jsonl Rich JSON per line: src, tgt, errors[] with type/category/schema_tag/schema_l2_tag.
chat Instruction-tuning chat format (messages: [...]). For QLoRA / SFT fine-tuning of chat LLMs.
sft Minimal {src, tgt} JSONL. Compatible with standard SFT trainers.

Python API

Beyond the CLI, the GeneratedResult object on pipeline.generate() exposes:

  • result.formatted — GECToR token-level tags (string).
  • result.to_tsv() — parallel src/tgt.
  • result.to_jsonl() — rich per-record JSON.
  • result.to_chat() — instruction-tuning chat format.
  • result.to_diff() — human-readable inline diff (CLI-unexposed).

Rule-targeted SFT

The separate synterr generate-targeted command writes {src, tgt, rule} JSONL with one line per LoRuGEC rule force-applied. A .dist.json sidecar records the per-rule generation count. See synterr.sft.generate_targeted for the Python API.