The argument in four panels

A graphical walk through the paper's claim. Each panel is one beat: aggregate metrics rise, specific rules silently collapse, the cause is a directional bias in the synthetic data, and a brief continuation pass on real examples reverses it.

Aggregate gets better

Across eight open models from 0.8B to 12B parameters, fine-tuning on SyntErr-generated synthetic data raises overall F_0.5 on LoRuGEC. The story you would tell from the leaderboard is that synthetic data works.

Aggregate F0.5 rises with model size while one specific rule drops. — Grey: aggregate F₀.₅. Red: subordinate-clause comma rule (Panel 2).

But specific rules collapse

The diagnostic, anchored in 98 rules from Rozental's reference grammar, breaks the aggregate apart. Most rules track the trend up. A handful go the other way: subordinate-clause comma accuracy drops from 14% to 1%, parenthetical commas similarly. The aggregate hides this because the rising rules outweigh the collapsing ones.

Eight per-rule F0.5 curves. Six go up; two go down. — Per-rule trends across 8 model sizes. Grey: improves with fine-tuning. Red: regresses.

The mechanism is directional skew

The training data is not symmetric. SyntErr generates comma insertion errors 3.6× more often than comma deletion errors. A model trained on this learns a one-sided rule: "preserve commas". On the LoRuGEC test set, where the gold edit is to remove a comma, the model refuses.

A bar chart showing 78% comma insertions vs 22% comma deletions in SyntErr training data. — Comma error directions in SyntErr's output. The 3.6× skew explains the suppression.

A brief continuation reverses it

The suppressed rules can be recovered by training for a short pass on real examples that exhibit the missing direction. With just 348 LoRuGEC sentences, subordinate-clause comma accuracy goes from 1% back to 69%, while the rules that improved through SyntErr stay where they were.

Slope chart from 1% to 69% on subordinate-clause comma after continuation. — Suppressed rules recover with a 348-example continuation pass.

Takeaway

Aggregate metrics make synthetic-data fine-tuning look like a uniform improvement. Per-rule evaluation makes the directional damage visible. Once visible, it is fixable. The taxonomy is the diagnostic; the diagnostic is the fix.

Read the paper page · Per-rule table · Reasoning chains