Skip to content

synterr

Reproducible synthetic error generation for Grammatical Error Correction.

Synterr corrupts clean text with linguistically-motivated errors and labels each one with the rule it violates. The output is training data for GEC models, with two properties most synthetic-corruption tools don't have:

  • Every error has a defensible label. Each corruption maps to a Rozental § paragraph (or RLC tag, or ERRANT tag). You can filter, re-weight, and audit by rule.
  • Every error has a syntactic justification. Government, agreement, and punctuation handlers use dependency-tree heuristics to fire on the right syntactic positions, not arbitrary tokens.

Where to start

Quick taste

Tagged corruptions: second locative, double comparative, asyndetic comma

Every corruption carries its handler-level error type; select a schema at generation time (--schema rozental) and each error is additionally labeled with §-grounded taxonomy tags:

A JSONL record produced with --schema rozental: handler-owned fields plus schema_tag, schema_l2_tag and schema_l2_applicability

Citation

If you use synterr in research, please see the Citation block in the README.