What Aggregate Scores Hide

Per-Rule Evaluation of Russian Grammatical Error Correction

Anna Smirnova¹, Artyom Kopan², Vladislav Makeev², George Chernishev²
¹HEC Lausanne, University of Lausanne ²Saint Petersburg State University
Accepted at ACL BEA 2026

Models can improve on a benchmark while getting worse at specific things. Aggregate metrics can't tell you. We built a diagnostic that can: a 98-category error taxonomy grounded in Rozental's reference grammar that lets us evaluate Russian grammatical error correction rule by rule. The diagnostic surfaced what the aggregate hid.

Aggregate F0.5 rises with model size while subordinate-clause comma accuracy collapses. — Aggregate F0.5 rises with parameter count across all eight fine-tuned models. The subordinate-clause comma rule collapses to 1%, hidden in the average. (Schematic; the per-rule numbers behind it are in /tables/per-rule/.)

For the full argument as a four-panel walkthrough, see the graphical overview →

Across eight models from 0.8B to 12B parameters, fine-tuning on synthetic data raised overall F_0.5 but drove subordinate-clause comma accuracy from 14% to 1%. The cause is directional skew in the training data: SyntErr generates comma insertion errors 3.6× more often than deletion errors, so models learn to preserve commas rather than remove them. Aggregate F_0.5 rose because spelling gains outweighed punctuation losses. Continuation training on 348 real examples recovers the affected rules from 1% to 69%; the suppression is reversible, but only once it is diagnosed.

Browse the data

Taxonomy cross-reference (98 fine tags × corpus tagsets) browse
Evidence table (213 §§ × 4 corpora) browse
Per-rule LoRuGEC scores (48 × 35 conditions) browse
GERA reclassification (5,988 errors → Rozental L2) browse
Reasoning chains: Qwen3.5-9B think (612 chains) browse
Reasoning chains: Qwen3.5-35B think (612 chains) browse

Training data & reproducibility

v4 SFT data is pinned to a specific generation commit and SHA-256 byte-verified against the file the paper's models trained on. Run scripts/verify_v4.py after cloning to confirm.

v4 SFT training data (39,209 examples, 21 MB) qwen_sft_v4.jsonl
Source-mix corpus (150K sentences, 30 MB) mixed_sources_v4.txt
Full pipeline documentation V4_DATA_PROVENANCE.md
SHA-256 checksums (all v4 artifacts) v4_checksums.txt
Verifier script verify_v4.py
Per-rule generation distribution qwen_sft_v4.dist.json

Download

Paper (camera-ready PDF) paper.pdf added after May 12
SyntErr generator (code) github.com/synterr-nlp/synterr
Adapter weights (29 LoRAs, 5.5 GB) HF: synterr-nlp/bea2026-gec-adapters
Training data (SyntErr v4, 39,209 examples) HF: synterr-nlp/synterr-v4-sft
Pipeline diagram (HTML, A4 landscape) pipeline.html
Cross-reference (raw xlsx) cross_reference.xlsx
Evidence table (raw xlsx) evidence_table.xlsx
Per-rule LoRuGEC eval (CSV) per_rule_lorugec.csv
Reasoning vs zero-shot (CSV, R1.Q2) per_rule_thinking_vs_zs.csv
GERA reclassification (JSONL) classified_gera.jsonl
Qwen3.5-9B reasoning chains (JSONL) qwen35_9b_think_lorugec_test.jsonl
Qwen3.5-35B reasoning chains (JSONL) qwen35_35b_think_lorugec_test.jsonl
Whole BEA supplementary bundle supplementary.zip
UNIL AI Day abstract unil_ai_day_abstract.pdf added when finalized
Poster poster.pdf in progress

Cite

Paper:

@inproceedings{smirnova2026aggregate,
  title     = {What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction},
  author    = {Smirnova, Anna and Kopan, Artyom and Makeev, Vladislav and Chernishev, George},
  booktitle = {Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA)},
  year      = {2026},
  url       = {https://synterr-nlp.github.io/papers/bea-2026/},
}

Software release:

@software{synterr_2026,
  title   = {synterr: rule-grounded synthetic error generation for Russian GEC},
  author  = {Smirnova, Anna and Kopan, Artyom and Makeev, Vladislav and Chernishev, George},
  year    = {2026},
  version = {v1.0.1},
  doi     = {10.5281/zenodo.20182862},
  url     = {https://github.com/synterr-nlp/synterr},
}