S0 → S1: Fine-Tuning

We took a small pre-trained language model and taught it to translate English to Japanese by showing it thousands of example translations. The question: does this extra training actually make the translations better?

Try the live S3 translator
How S1 Was Built from S0
QLoRA Supervised Fine-Tuning Pipeline
S0 — Prompt-Only Baseline
Qwen 2.5-0.5B-Instruct straight out of the box. We just gave it a prompt saying “translate this to Japanese” — no extra training at all. This is our control group.
S1 — QLoRA Fine-Tuned
Same model, but we added a small trainable layer on top (called a LoRA adapter) and trained it on 24,000 English-Japanese translation pairs. Think of it like adding a specialised lens to a camera — the base model stays frozen, we only train the lens.
Training Config (Final Stable Rerun)
LoRA rank 16 alpha 16 dropout 0.05 lr 5e-6 batch 8 grad accum 4 1 epoch / 400 max steps max seq 2048 early stopping patience 1 eval every 100 steps 4-bit NF4 bfloat16 compute gradient clipping 0.1
LoRA Target Modules
q_proj k_proj v_proj o_proj gate_proj up_proj down_proj
Understanding the Benchmarks
BLEU (char)
Does the output look like the reference? Compares the model’s translation character-by-character against the “correct” answer. Higher = closer match. Good for catching obvious mistakes, but a perfectly valid translation that uses different wording will score low.
chrF++
A more forgiving version of BLEU. Instead of requiring exact matches, it gives partial credit for similar characters and word patterns. Especially useful for Japanese where the same meaning can be written in multiple ways (kanji, hiragana, katakana).
COMET
Does the translation actually mean the same thing? An AI model trained on human judgements that scores whether the meaning is preserved, not just the wording. This is our most important metric. A translation can score low on BLEU (different words) but high on COMET (same meaning) — that’s actually fine.
Terminology Accuracy
Did it use the right technical terms? We have a glossary saying e.g. “password” must be translated as “パスワード”, not “暗証番号”. This metric checks if the model used the approved term. A translation can sound natural but use the wrong industry term — that’s a real problem in professional localisation.
S0 vs S1 Benchmark — test_v1
4,254 samples · Qwen 2.5-0.5B-Instruct test_v1.jsonl
MetricS0S1Delta
Avg Latency (ms)169.8258.6+88.8
BLEU (char)16.2917.64+1.35
chrF++12.7612.91+0.15
COMET0.72910.7303+0.0012
Terminology Accuracy0.14630.1454-0.0009
S0 vs S1 Side by Side
Normalized scores · higher is better
Training Runs — dev_v2
Run A — 2,668 dev samples best run
MetricS0 DevS1 Run ADelta
BLEU (char)22.0624.23+2.17
chrF++17.0517.66+0.61
COMET0.78070.7858+0.0051
Terminology Accuracy0.21620.2095-0.0067
Run B — 2,668 dev samples lr 3e-6
MetricS0 DevS1 Run BDelta
BLEU (char)22.0623.24+1.18
chrF++17.0517.21+0.16
COMET0.78070.7816+0.0009
Terminology Accuracy0.21620.2141-0.0021
Run D — 2,668 dev samples assistant-only loss
MetricS0 DevS1 Run DDelta
BLEU (char)22.065.26-16.80
chrF++17.0510.62-6.43
COMET0.78070.6145-0.1662
Terminology Accuracy0.21620.2257+0.0095
What We Learned

The translations got slightly better at looking like the reference, but not at actually meaning the right thing. Runs A and B improved BLEU (how similar the output looks to the reference), but COMET (whether the meaning is correct) barely moved — only +0.0012 on test. The model learned to mimic the style of the training data without deeper understanding.

The model got worse at using the right technical terms. Even after training on 24K translation pairs, the model didn’t learn to use our approved glossary terms like “パスワード” for “password”. Terminology accuracy actually dropped in Runs A and B. The training data taught general fluency, not domain-specific vocabulary.

We found you can’t have both. In Run D, we changed the training so the model only learned from the Japanese output (ignoring the English input during loss calculation). This was the only run that improved terminology — but it broke everything else. BLEU dropped from 22 to 5. The model got better at producing Japanese words but forgot how to connect them to the English meaning.

Training a model this small is like walking a tightrope. Run B crashed partway through training with NaN errors (the math literally broke). With only 500 million parameters, there’s very little room for the model to learn new things without destabilising what it already knows.

Bottom line: fine-tuning a small model gave us small improvements. The real gains came from giving the model access to external knowledge (S2 retrieval) and letting a bigger model make decisions (S3 agentic).

What Changed Between Runs
Hyperparameter differences across training runs
ParameterRun ARun BRun D
Learning Rate5e-63e-65e-6
LoRA Rank / Alpha16 / 1616 / 1616 / 16
Loss Maskingfull sequencefull sequenceassistant-only
Max Steps600600300
Epochs111
Gradient Clipping0.10.10.1
Result Best overall Late NaN Term ↑ Fluency ↓↓
Run A — baseline S1 config
  • Changed: nothing — this was the first stable run using the base config
  • Result: best dev metrics (BLEU +2.17, COMET +0.0051), stable training, no NaN
  • But: terminology accuracy dropped -0.0067
  • A separate final stable rerun (using the same base config) was used for the test_v1 S1 evaluation
Run B — lower learning rate
  • Changed: learning rate 5e-6 → 3e-6 (how fast the model learns per step — lower = slower, more cautious updates)
  • Why: Run A had worked, but earlier attempts had crashed mid-training. We thought slowing down the learning might prevent that
  • Result: worse on every metric (BLEU +1.18 vs +2.17) and still crashed with NaN anyway
  • Takeaway: learning slower didn’t help — the model just absorbed less from the training data in the same number of steps
Run D — assistant-only loss masking
  • Changed: how the model learns from each example. Normally it learns from the entire conversation (English input + Japanese output). In Run D, we told it to only learn from the Japanese output and ignore the English side during training
  • Why: we thought focusing purely on the Japanese would teach it better Japanese vocabulary and terminology
  • Win: it worked for terminology — the only run where the model used more correct glossary terms (+0.0095)
  • Cost: but the translations became nearly unusable. BLEU dropped from 22 to 5 and COMET from 0.78 to 0.61
  • Takeaway: the model could still see the English text, but since it wasn’t being trained on that connection, it lost the ability to translate coherently. It learned better Japanese words but forgot what they were supposed to mean in context
What is NaN instability?
  • What is NaN? It stands for “Not a Number”. During training, the model does millions of math operations. If one of those calculations produces an impossible result (like dividing by zero), it outputs NaN instead of a real number
  • Why is it a problem? Once NaN appears in one calculation, it spreads like a virus. Every subsequent calculation that touches it also becomes NaN. The model’s learned knowledge gets corrupted and it starts producing nonsense
  • What happened in our runs? Training would look completely normal for most of the run — loss going down, metrics improving. Then suddenly, without warning, the loss would spike to NaN and the model would be broken
  • Why is this tricky? Because the early checkpoints (snapshots we save during training) look fine. You think the run succeeded until you check the later checkpoints and realise they’re corrupted
  • Why does this happen with small models? We compressed the model to 4-bit precision to fit it on our hardware. This means every number is stored with less accuracy — like rounding to fewer decimal places. With less room for precision, small errors can snowball into NaN
  • How did we deal with it? We saved checkpoints frequently (every 100 steps), set a strict early-stopping rule, limited training to 300-600 steps, and used gradient clipping (capping how big each update can be). If a run hit NaN, we threw away the broken checkpoints and kept the last good one
Why other hyperparameters were held constant
  • LoRA rank (16): rank controls how much the adapter can learn. On a small 0.5B model, making the adapter too big (high rank) risks the model just memorising the training data instead of learning general patterns
  • Batch size (32 effective): this is how many examples the model sees before updating its weights. 32 was the most our hardware could handle — going higher would have required a bigger GPU
  • Max sequence length (2048): the maximum length of text the model can process at once. Most of our translation pairs were short (~100 characters), but some longer ones needed up to 500. We set a generous ceiling to be safe
  • Training length (1 epoch, 300-600 steps): we deliberately kept training short because longer runs crashed with NaN. One pass through half the data was safer than multiple passes through all of it
  • Dropout (0.05), clipping (0.1): these are safety mechanisms — dropout randomly disables parts of the model during training to prevent memorisation, clipping limits how big each weight update can be. Both were already set conservatively
  • Our approach: we changed one or two things per run so we could tell exactly what caused each result, rather than changing everything at once and guessing
Training Timeline
PhaseWhat happened
Setup Established S0 baseline with Qwen 2.5-0.5B-Instruct. Configured QLoRA (rank 16, NF4, lr 5e-6). Froze splits: train_v2 (24K), dev_v2 (2,668), test_v1 (4,254).
Run A Best dev metrics: BLEU +2.17, COMET +0.0051. Stable training, no NaN. Terminology accuracy declined slightly (-0.0067). Promoted to test_v1 evaluation.
Run B Lowered lr to 3e-6 hoping for stability. Underperformed Run A on all metrics. Developed late NaN at later checkpoints. Rejected.
Run D Switched to assistant-only loss masking. Only run to improve terminology (+0.0095). But BLEU collapsed to 5.26 and COMET to 0.6145. Confirmed the fluency-vs-terminology tradeoff.
Conclusion 0.5B fine-tuning hit a ceiling. A final stable rerun (same base config as Run A) was used as S1. Motivated pivot to retrieval (S2) and agentic (S3) approaches.
Key Findings
Small but real gains
Fine-tuning improved BLEU and chrF++ consistently, but COMET gains were marginal (+0.0012 on test).
Terminology gap persists
Fine-tuning did not improve glossary compliance. Accuracy dropped slightly in every run (-0.0009 to -0.0067).
Fluency vs terminology tradeoff
Run D (assistant-only loss) improved terminology +0.0095 but destroyed fluency (BLEU -16.8). In our experiments, optimizing for both pulled in opposite directions.
Motivated the pivot to S2/S3
Fine-tuning alone hit a ceiling. Retrieval (S2) and agentic methods (S3) were needed to achieve real quality gains: COMET jumped from 0.73 to 0.95.
Frequently Asked Questions
Why 0.5B instead of the recommended 7B model? +
  • The project kickoff recommended 7B-8B, but we switched to 0.5B because it was faster to iterate on and fit our local GPU constraints
  • Our available VPS did not have enough VRAM for 7B QLoRA training with the full dataset
  • The tradeoff is acknowledged in the report — 0.5B has less capacity for learning, which likely contributed to the flat S0 vs S1 results
Why QLoRA instead of full fine-tuning? +
  • QLoRA was the intended S1 method from the project specification
  • It uses 4-bit quantization of the base model and only trains small adapter weights, reducing memory requirements significantly
  • We did not run a head-to-head full-fine-tuning baseline, so we cannot claim QLoRA was proven better — it was chosen for practical feasibility
Why only 1 epoch with 300-600 step caps? +
  • The 0.5B model showed late NaN instability on longer runs — training would look healthy and then suddenly collapse
  • Shorter, conservative runs with frequent evaluation (every 100 steps) and early stopping (patience 1) were more reliable
  • With 24K training examples and effective batch size 32, one full epoch is ~750 steps. Capping at 300-600 meant we used roughly half the data per run, but this was a deliberate stability tradeoff
Why not increase LoRA rank beyond 16? +
  • On a 0.5B model, the adapter at rank 16 already represents a relatively large fraction of the model’s total parameters
  • Increasing rank risks overfitting to the training data, especially with only 24K examples
  • We did not run a rank-ablation study, so this is based on standard QLoRA guidance rather than empirical proof from our runs
Why not try a different base model (Llama, Gemma)? +
  • The S0/S1 comparison was designed as a controlled experiment with one backbone model to isolate the effect of fine-tuning
  • Qwen 2.5 was chosen because it has strong multilingual support including Japanese
  • Testing other base models was out of scope for this rerun — we cannot claim Qwen was proven superior, only that it was a reasonable choice for EN-JA translation
Why no separate error-identification adapter? +
  • The project spec recommended separate translation and error-ID adapters “if time allows”
  • We prioritised getting the translation adapter working and stable first
  • A config file for error-ID training exists (finetune_error_id.yaml) but was not part of the locked S0/S1 rerun path
What would you do differently with more time? +
  • Rerun at the originally recommended 7B/8B scale on proper GPU hardware — larger models have more capacity for learning translation patterns
  • Build the separate error-identification adapter that was deferred
  • Run a LoRA rank ablation study (rank 8 vs 16 vs 32 vs 64) to find the optimal adapter size
  • Continue investing in retrieval and agentic approaches, since the results show those delivered the largest quality gains
How does S1 compare to just using ChatGPT or Claude directly? +
  • We did not run a direct prompt-only ChatGPT or Claude baseline for S0/S1, so we cannot make a clean comparison
  • The closest evidence is S3, which uses Claude Sonnet 4.6 in an agentic RAG pipeline and achieves much higher quality (COMET 0.9552 vs S1’s 0.7303)
  • However, S3 is not apples-to-apples — it uses additional tools, retrieval, and self-audit that a plain ChatGPT/Claude prompt would not have
  • The honest answer: a SOTA model with good prompting would likely outperform a fine-tuned 0.5B model, but the point of S0/S1 was to study what fine-tuning alone can do at small scale
Try the live S3 translator