The argument in four panels
A graphical walk through the paper's claim. Each panel is one beat: aggregate metrics rise, specific rules silently collapse, the cause is a directional bias in the synthetic data, and a brief continuation pass on real examples reverses it.
Aggregate gets better
Across eight open models from 0.8B to 12B parameters, fine-tuning on SyntErr-generated synthetic data raises overall F0.5 on LoRuGEC. The story you would tell from the leaderboard is that synthetic data works.
But specific rules collapse
The diagnostic, anchored in 98 rules from Rozental's reference grammar, breaks the aggregate apart. Most rules track the trend up. A handful go the other way: subordinate-clause comma accuracy drops from 14% to 1%, parenthetical commas similarly. The aggregate hides this because the rising rules outweigh the collapsing ones.
The mechanism is directional skew
The training data is not symmetric. SyntErr generates comma insertion errors 3.6× more often than comma deletion errors. A model trained on this learns a one-sided rule: "preserve commas". On the LoRuGEC test set, where the gold edit is to remove a comma, the model refuses.
A brief continuation reverses it
The suppressed rules can be recovered by training for a short pass on real examples that exhibit the missing direction. With just 348 LoRuGEC sentences, subordinate-clause comma accuracy goes from 1% back to 69%, while the rules that improved through SyntErr stay where they were.
Takeaway
Aggregate metrics make synthetic-data fine-tuning look like a uniform improvement. Per-rule evaluation makes the directional damage visible. Once visible, it is fixable. The taxonomy is the diagnostic; the diagnostic is the fix.