We took a small pre-trained language model and taught it to translate English to Japanese by showing it thousands of example translations. The question: does this extra training actually make the translations better?
Try the live S3 translator| Metric | S0 | S1 | Delta |
|---|---|---|---|
| Avg Latency (ms) | 169.8 | 258.6 | +88.8 |
| BLEU (char) | 16.29 | 17.64 | +1.35 |
| chrF++ | 12.76 | 12.91 | +0.15 |
| COMET | 0.7291 | 0.7303 | +0.0012 |
| Terminology Accuracy | 0.1463 | 0.1454 | -0.0009 |
| Metric | S0 Dev | S1 Run A | Delta |
|---|---|---|---|
| BLEU (char) | 22.06 | 24.23 | +2.17 |
| chrF++ | 17.05 | 17.66 | +0.61 |
| COMET | 0.7807 | 0.7858 | +0.0051 |
| Terminology Accuracy | 0.2162 | 0.2095 | -0.0067 |
| Metric | S0 Dev | S1 Run B | Delta |
|---|---|---|---|
| BLEU (char) | 22.06 | 23.24 | +1.18 |
| chrF++ | 17.05 | 17.21 | +0.16 |
| COMET | 0.7807 | 0.7816 | +0.0009 |
| Terminology Accuracy | 0.2162 | 0.2141 | -0.0021 |
| Metric | S0 Dev | S1 Run D | Delta |
|---|---|---|---|
| BLEU (char) | 22.06 | 5.26 | -16.80 |
| chrF++ | 17.05 | 10.62 | -6.43 |
| COMET | 0.7807 | 0.6145 | -0.1662 |
| Terminology Accuracy | 0.2162 | 0.2257 | +0.0095 |
The translations got slightly better at looking like the reference, but not at actually meaning the right thing. Runs A and B improved BLEU (how similar the output looks to the reference), but COMET (whether the meaning is correct) barely moved — only +0.0012 on test. The model learned to mimic the style of the training data without deeper understanding.
The model got worse at using the right technical terms. Even after training on 24K translation pairs, the model didn’t learn to use our approved glossary terms like “パスワード” for “password”. Terminology accuracy actually dropped in Runs A and B. The training data taught general fluency, not domain-specific vocabulary.
We found you can’t have both. In Run D, we changed the training so the model only learned from the Japanese output (ignoring the English input during loss calculation). This was the only run that improved terminology — but it broke everything else. BLEU dropped from 22 to 5. The model got better at producing Japanese words but forgot how to connect them to the English meaning.
Training a model this small is like walking a tightrope. Run B crashed partway through training with NaN errors (the math literally broke). With only 500 million parameters, there’s very little room for the model to learn new things without destabilising what it already knows.
Bottom line: fine-tuning a small model gave us small improvements. The real gains came from giving the model access to external knowledge (S2 retrieval) and letting a bigger model make decisions (S3 agentic).
| Parameter | Run A | Run B | Run D |
|---|---|---|---|
| Learning Rate | 5e-6 | 3e-6 | 5e-6 |
| LoRA Rank / Alpha | 16 / 16 | 16 / 16 | 16 / 16 |
| Loss Masking | full sequence | full sequence | assistant-only |
| Max Steps | 600 | 600 | 300 |
| Epochs | 1 | 1 | 1 |
| Gradient Clipping | 0.1 | 0.1 | 0.1 |
| Result | Best overall | Late NaN | Term ↑ Fluency ↓↓ |
| Phase | What happened |
|---|---|
| Setup | Established S0 baseline with Qwen 2.5-0.5B-Instruct. Configured QLoRA (rank 16, NF4, lr 5e-6). Froze splits: train_v2 (24K), dev_v2 (2,668), test_v1 (4,254). |
| Run A | Best dev metrics: BLEU +2.17, COMET +0.0051. Stable training, no NaN. Terminology accuracy declined slightly (-0.0067). Promoted to test_v1 evaluation. |
| Run B | Lowered lr to 3e-6 hoping for stability. Underperformed Run A on all metrics. Developed late NaN at later checkpoints. Rejected. |
| Run D | Switched to assistant-only loss masking. Only run to improve terminology (+0.0095). But BLEU collapsed to 5.26 and COMET to 0.6145. Confirmed the fluency-vs-terminology tradeoff. |
| Conclusion | 0.5B fine-tuning hit a ceiling. A final stable rerun (same base config as Run A) was used as S1. Motivated pivot to retrieval (S2) and agentic (S3) approaches. |
finetune_error_id.yaml) but was not part of the locked S0/S1 rerun path