Grammatical error correction,
evaluated honestly.

Most GEC benchmarks summarize heterogeneous behavior into one F-score and miss the parts where models silently regress. We build the rule-grounded synthetic data and per-rule diagnostics that surface what aggregate metrics hide. Currently focused on Russian.

Мы гуляли в ~~лесе~~лесу весь день.

Real output: synterr corrupts clean text into a second-locative error and labels it — noun_case_prep_e_u §152 — one of 75 error types placed by the dependency tree, not by chance.

Read the BEA 2026 overview Code on GitHub

Accepted at ACL BEA 2026

What Aggregate Scores Hide

Across eight open models from 0.8B to 12B parameters, fine-tuning on synthetic data raises overall F_0.5 on LoRuGEC — while driving subordinate-clause comma accuracy from 14% to 1%. Aggregate scores can't detect this; per-rule diagnostics can. Once detected, it's fixable.

graphical overview paper landing per-rule data reasoning chains

§ 1

Synterr, the generator

An open-source synthetic-error generator for Russian: 28 handlers, 75 error types, each grounded in the dependency tree and mapped to a 100-category taxonomy from Rozental's reference grammar. Errors carry their rule — filter, re-weight, and audit by it.

docs · github

§ 2

Per-rule diagnostics

A browsable taxonomy and per-rule scores for evaluating Russian GEC systems honestly. Cell-coloured xlsx parsed natively in your browser — sortable, filterable, downloadable.

browse the taxonomy

§ 3

Reproducible pipelines

Every released dataset ships with a pinned generation commit, SHA-256 checksums, and a verifier script. The v4 SFT data is byte-verified against the file the paper trained on.

v4 provenance

A Russian sentence with three rule-grounded error sites annotated on its dependency tree.

Errors are placed by the parse, not by chance.

Three rule-grounded error sites on one sentence: agreement (amod), a subordinate-clause boundary (acl:relcl), case government (obl). Handlers fire only where the dependency tree licenses the error — and refuse where the "corruption" would be correct Russian, so a non-error never enters the training data.

Recent

2026-05-14 Synterr v1.0.1 on Zenodo. Archived at 10.5281/zenodo.20182862.
2026-05-09 Site online. Browsable tables, reasoning-chain viewer, graphical overview.
2026-05-04 BEA 2026 acceptance. "What Aggregate Scores Hide" accepted at the 21st BEA workshop.
2026-05-01 Synterr v1.0. Paper release with reproducibility pack (checksums, verify script, pinned commit).
2026-03-22 v4 dataset frozen. 39,209 SFT examples, deterministic build pipeline.

Grammatical error correction,evaluated honestly.