← BEA 2026 paper

The argument in four panels

A graphical walk through the paper's claim. Each panel is one beat: aggregate metrics rise, specific rules silently collapse, the cause is a directional bias in the synthetic data, and a brief continuation pass on real examples reverses it.

1

Aggregate gets better

Across eight open models from 0.8B to 12B parameters, fine-tuning on SyntErr-generated synthetic data raises overall F0.5 on LoRuGEC. The story you would tell from the leaderboard is that synthetic data works.

Aggregate F0.5 rises with model size while one specific rule drops.
Grey: aggregate F₀.₅. Red: subordinate-clause comma rule (Panel 2).
2

But specific rules collapse

The diagnostic, anchored in 98 rules from Rozental's reference grammar, breaks the aggregate apart. Most rules track the trend up. A handful go the other way: subordinate-clause comma accuracy drops from 14% to 1%, parenthetical commas similarly. The aggregate hides this because the rising rules outweigh the collapsing ones.

Eight per-rule F0.5 curves. Six go up; two go down.
Per-rule trends across 8 model sizes. Grey: improves with fine-tuning. Red: regresses.
3

The mechanism is directional skew

The training data is not symmetric. SyntErr generates comma insertion errors 3.6× more often than comma deletion errors. A model trained on this learns a one-sided rule: "preserve commas". On the LoRuGEC test set, where the gold edit is to remove a comma, the model refuses.

A bar chart showing 78% comma insertions vs 22% comma deletions in SyntErr training data.
Comma error directions in SyntErr's output. The 3.6× skew explains the suppression.
4

A brief continuation reverses it

The suppressed rules can be recovered by training for a short pass on real examples that exhibit the missing direction. With just 348 LoRuGEC sentences, subordinate-clause comma accuracy goes from 1% back to 69%, while the rules that improved through SyntErr stay where they were.

Slope chart from 1% to 69% on subordinate-clause comma after continuation.
Suppressed rules recover with a 348-example continuation pass.

Takeaway

Aggregate metrics make synthetic-data fine-tuning look like a uniform improvement. Per-rule evaluation makes the directional damage visible. Once visible, it is fixable. The taxonomy is the diagnostic; the diagnostic is the fix.

Read the paper page  ·  Per-rule table  ·  Reasoning chains