What Aggregate Scores Hide
Per-Rule Evaluation of Russian Grammatical Error Correction
Anna Smirnova1,
Artyom Kopan2,
Vladislav Makeev2,
George Chernishev2
1HEC Lausanne, University of Lausanne
2Saint Petersburg State University
Accepted at ACL BEA 2026
Models can improve on a benchmark while getting worse at specific things. Aggregate metrics can't tell you. We built a diagnostic that can: a 98-category error taxonomy grounded in Rozental's reference grammar that lets us evaluate Russian grammatical error correction rule by rule. The diagnostic surfaced what the aggregate hid.
For the full argument as a four-panel walkthrough, see the graphical overview →
Across eight models from 0.8B to 12B parameters, fine-tuning on synthetic data raised overall F0.5 but drove subordinate-clause comma accuracy from 14% to 1%. The cause is directional skew in the training data: SyntErr generates comma insertion errors 3.6× more often than deletion errors, so models learn to preserve commas rather than remove them. Aggregate F0.5 rose because spelling gains outweighed punctuation losses. Continuation training on 348 real examples recovers the affected rules from 1% to 69%; the suppression is reversible, but only once it is diagnosed.
Browse the data
- Taxonomy cross-reference (98 fine tags × corpus tagsets) browse
- Evidence table (213 §§ × 4 corpora) browse
- Per-rule LoRuGEC scores (48 × 35 conditions) browse
- GERA reclassification (5,988 errors → Rozental L2) browse
- Reasoning chains: Qwen3.5-9B think (612 chains) browse
- Reasoning chains: Qwen3.5-35B think (612 chains) browse
Training data & reproducibility
v4 SFT data is pinned to a specific generation commit and SHA-256
byte-verified against the file the paper's models trained on. Run
scripts/verify_v4.py
after cloning to confirm.
- v4 SFT training data (39,209 examples, 21 MB) qwen_sft_v4.jsonl
- Source-mix corpus (150K sentences, 30 MB) mixed_sources_v4.txt
- Full pipeline documentation V4_DATA_PROVENANCE.md
- SHA-256 checksums (all v4 artifacts) v4_checksums.txt
- Verifier script verify_v4.py
- Per-rule generation distribution qwen_sft_v4.dist.json
Download
- Paper (camera-ready PDF) paper.pdf added after May 12
- SyntErr generator (code) github.com/synterr-nlp/synterr
- Adapter weights (29 LoRAs, 5.5 GB) HF: synterr-nlp/bea2026-gec-adapters
- Training data (SyntErr v4, 39,209 examples) HF: synterr-nlp/synterr-v4-sft
- Pipeline diagram (HTML, A4 landscape) pipeline.html
- Cross-reference (raw xlsx) cross_reference.xlsx
- Evidence table (raw xlsx) evidence_table.xlsx
- Per-rule LoRuGEC eval (CSV) per_rule_lorugec.csv
- Reasoning vs zero-shot (CSV, R1.Q2) per_rule_thinking_vs_zs.csv
- GERA reclassification (JSONL) classified_gera.jsonl
- Qwen3.5-9B reasoning chains (JSONL) qwen35_9b_think_lorugec_test.jsonl
- Qwen3.5-35B reasoning chains (JSONL) qwen35_35b_think_lorugec_test.jsonl
- Whole BEA supplementary bundle supplementary.zip
- UNIL AI Day abstract unil_ai_day_abstract.pdf added when finalized
- Poster poster.pdf in progress
Cite
Paper:
@inproceedings{smirnova2026aggregate,
title = {What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction},
author = {Smirnova, Anna and Kopan, Artyom and Makeev, Vladislav and Chernishev, George},
booktitle = {Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA)},
year = {2026},
url = {https://synterr-nlp.github.io/papers/bea-2026/},
} Software release:
@software{synterr_2026,
title = {synterr: rule-grounded synthetic error generation for Russian GEC},
author = {Smirnova, Anna and Kopan, Artyom and Makeev, Vladislav and Chernishev, George},
year = {2026},
version = {v1.0.1},
doi = {10.5281/zenodo.20182862},
url = {https://github.com/synterr-nlp/synterr},
}