Grammatical error correction,
evaluated honestly.
Most GEC benchmarks summarize heterogeneous behavior into one F-score and miss the parts where models silently regress. We build the rule-grounded synthetic data and per-rule diagnostics that surface what aggregate metrics hide. Currently focused on Russian.
Мы гуляли в леселесу весь день.
Accepted at ACL BEA 2026
What Aggregate Scores Hide
Across eight open models from 0.8B to 12B parameters, fine-tuning on synthetic data raises overall F0.5 on LoRuGEC — while driving subordinate-clause comma accuracy from 14% to 1%. Aggregate scores can't detect this; per-rule diagnostics can. Once detected, it's fixable.
graphical overview paper landing per-rule data reasoning chains
Per-rule diagnostics
A browsable taxonomy and per-rule scores for evaluating Russian GEC systems honestly. Cell-coloured xlsx parsed natively in your browser — sortable, filterable, downloadable.
Reproducible pipelines
Every released dataset ships with a pinned generation commit, SHA-256 checksums, and a verifier script. The v4 SFT data is byte-verified against the file the paper trained on.
Errors are placed by the parse, not by chance.
Three rule-grounded error sites on one sentence: agreement
(amod), a subordinate-clause boundary
(acl:relcl), case government (obl).
Handlers fire only where the dependency tree licenses the error —
and refuse where the "corruption" would be correct Russian,
so a non-error never enters the training data.
Recent
- Synterr v1.0.1 on Zenodo. Archived at 10.5281/zenodo.20182862.
- Site online. Browsable tables, reasoning-chain viewer, graphical overview.
- BEA 2026 acceptance. "What Aggregate Scores Hide" accepted at the 21st BEA workshop.
- Synterr v1.0. Paper release with reproducibility pack (checksums, verify script, pinned commit).
- v4 dataset frozen. 39,209 SFT examples, deterministic build pipeline.