Output formats¶
Synterr emits corrupted text in several formats. CLI selection is via the -f/--output-format flag on synterr generate.
CLI formats¶
| Flag | When to use |
|---|---|
gector |
GECToR token-level edit tags (default). Compatible with the GECToR training format. |
tsv |
Parallel src\ttgt lines. Compatible with most seq2seq pipelines. |
jsonl |
Rich JSON per line: src, tgt, errors[] with type/category/schema_tag/schema_l2_tag. |
chat |
Instruction-tuning chat format (messages: [...]). For QLoRA / SFT fine-tuning of chat LLMs. |
sft |
Minimal {src, tgt} JSONL. Compatible with standard SFT trainers. |
Python API¶
Beyond the CLI, the GeneratedResult object on pipeline.generate() exposes:
result.formatted— GECToR token-level tags (string).result.to_tsv()— parallel src/tgt.result.to_jsonl()— rich per-record JSON.result.to_chat()— instruction-tuning chat format.result.to_diff()— human-readable inline diff (CLI-unexposed).
Rule-targeted SFT¶
The separate synterr generate-targeted command writes {src, tgt, rule} JSONL with one line per LoRuGEC rule force-applied. A .dist.json sidecar records the per-rule generation count. See synterr.sft.generate_targeted for the Python API.