What Aggregate Scores Hide

Per-Rule Evaluation of Russian Grammatical Error Correction

Anna Smirnova1, Artyom Kopan2, Vladislav Makeev2, George Chernishev2
1HEC Lausanne, University of Lausanne   2Saint Petersburg State University
Accepted at ACL BEA 2026

Models can improve on a benchmark while getting worse at specific things. Aggregate metrics can't tell you. We built a diagnostic that can: a 98-category error taxonomy grounded in Rozental's reference grammar that lets us evaluate Russian grammatical error correction rule by rule. The diagnostic surfaced what the aggregate hid.

Aggregate F0.5 rises with model size while subordinate-clause comma accuracy collapses.
Aggregate F0.5 rises with parameter count across all eight fine-tuned models. The subordinate-clause comma rule collapses to 1%, hidden in the average. (Schematic; the per-rule numbers behind it are in /tables/per-rule/.)

For the full argument as a four-panel walkthrough, see the graphical overview →

Across eight models from 0.8B to 12B parameters, fine-tuning on synthetic data raised overall F0.5 but drove subordinate-clause comma accuracy from 14% to 1%. The cause is directional skew in the training data: SyntErr generates comma insertion errors 3.6× more often than deletion errors, so models learn to preserve commas rather than remove them. Aggregate F0.5 rose because spelling gains outweighed punctuation losses. Continuation training on 348 real examples recovers the affected rules from 1% to 69%; the suppression is reversible, but only once it is diagnosed.

Browse the data

Training data & reproducibility

v4 SFT data is pinned to a specific generation commit and SHA-256 byte-verified against the file the paper's models trained on. Run scripts/verify_v4.py after cloning to confirm.

Download

Cite

Paper:

@inproceedings{smirnova2026aggregate,
  title     = {What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction},
  author    = {Smirnova, Anna and Kopan, Artyom and Makeev, Vladislav and Chernishev, George},
  booktitle = {Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA)},
  year      = {2026},
  url       = {https://synterr-nlp.github.io/papers/bea-2026/},
}

Software release:

@software{synterr_2026,
  title   = {synterr: rule-grounded synthetic error generation for Russian GEC},
  author  = {Smirnova, Anna and Kopan, Artyom and Makeev, Vladislav and Chernishev, George},
  year    = {2026},
  version = {v1.0.1},
  doi     = {10.5281/zenodo.20182862},
  url     = {https://github.com/synterr-nlp/synterr},
}

Zenodo DOI badge