Reproducibility¶
The paper-release v4 dataset is pinned, checksummed, and verified byte-identical against the trained-on file. To regenerate it from scratch you need:
- The pinned generation commit:
898814d. Later releases include changes that came after this commit — most notably thenoun_casearc gate. Generating against any release tag will not produce bit-identical output. - The exact source corpora: documented in
data/V4_DATA_PROVENANCE.md. - Seed=42 in both
build_v4_sources.pyandgenerate_sft.py.
Verifying you have the right files¶
This checks SHA256 against data/v4_checksums.txt. Output should be:
Regenerating from scratch¶
git checkout 898814d
# 1. Mine scarce sentences
uv run python scripts/mine_scarce_sents.py …
# 2. Extract clean rublimp pool
uv run python scripts/extract_rublimp_pool.py …
# 3. Build source mix (150K, 60-40 pool/news split)
uv run python scripts/build_v4_sources.py \
--output data/mixed_sources_v4.txt \
--total 150000 --seed 42
# 4. Generate SFT
uv run python scripts/generate_sft.py \
-i data/mixed_sources_v4.txt \
-o data/qwen_sft_v4.jsonl \
-n 50000 --seed 42 --depparse \
--max-input 150000 --batch-size 128 \
--balance-directions
Full step-by-step (with corpus paths and benchmark exclusions) is in
V4_DATA_PROVENANCE.md.
What the v4 dataset is¶
- 39,209 SFT examples across the synterr handler set
- Source mix: 54,823 scarce-form-mined sentences + 57,106 RuBLiMP pool + 38,071 Taiga news
- No RuBLiMP benchmark contamination (excluded at build time)
- Direction-balanced for split / merge / insert / delete handlers
Citing¶
When citing the dataset specifically (vs. the synterr tool), reference
the paper and the pinned generation commit 898814d.