Research — EN→JA Translation

How S1 Was Built from S0

QLoRA Supervised Fine-Tuning Pipeline

S0 — Prompt-Only Baseline

Qwen 2.5-0.5B-Instruct straight out of the box. We just gave it a prompt saying “translate this to Japanese” — no extra training at all. This is our control group.

S1 — QLoRA Fine-Tuned

Same model, but we added a small trainable layer on top (called a LoRA adapter) and trained it on 24,000 English-Japanese translation pairs. Think of it like adding a specialised lens to a camera — the base model stays frozen, we only train the lens.

Training Config (Final Stable Rerun)

LoRA rank 16 alpha 16 dropout 0.05 lr 5e-6 batch 8 grad accum 4 1 epoch / 400 max steps max seq 2048 early stopping patience 1 eval every 100 steps 4-bit NF4 bfloat16 compute gradient clipping 0.1

LoRA Target Modules

q_proj k_proj v_proj o_proj gate_proj up_proj down_proj

Understanding the Benchmarks

BLEU (char)

Does the output look like the reference? Compares the model’s translation character-by-character against the “correct” answer. Higher = closer match. Good for catching obvious mistakes, but a perfectly valid translation that uses different wording will score low.

chrF++

A more forgiving version of BLEU. Instead of requiring exact matches, it gives partial credit for similar characters and word patterns. Especially useful for Japanese where the same meaning can be written in multiple ways (kanji, hiragana, katakana).

COMET

Does the translation actually mean the same thing? An AI model trained on human judgements that scores whether the meaning is preserved, not just the wording. This is our most important metric. A translation can score low on BLEU (different words) but high on COMET (same meaning) — that’s actually fine.

Terminology Accuracy

Did it use the right technical terms? We have a glossary saying e.g. “password” must be translated as “パスワード”, not “暗証番号”. This metric checks if the model used the approved term. A translation can sound natural but use the wrong industry term — that’s a real problem in professional localisation.

S0 vs S1 Benchmark — test_v1

4,254 samples · Qwen 2.5-0.5B-Instruct test_v1.jsonl

Metric	S0	S1	Delta
Avg Latency (ms)	169.8	258.6	+88.8
BLEU (char)	16.29	17.64	+1.35
chrF++	12.76	12.91	+0.15
COMET	0.7291	0.7303	+0.0012
Terminology Accuracy	0.1463	0.1454	-0.0009

S0 vs S1 Side by Side

Normalized scores · higher is better

Training Runs — dev_v2

Run A — 2,668 dev samples best run

Metric	S0 Dev	S1 Run A	Delta
BLEU (char)	22.06	24.23	+2.17
chrF++	17.05	17.66	+0.61
COMET	0.7807	0.7858	+0.0051
Terminology Accuracy	0.2162	0.2095	-0.0067

Run B — 2,668 dev samples lr 3e-6

Metric	S0 Dev	S1 Run B	Delta
BLEU (char)	22.06	23.24	+1.18
chrF++	17.05	17.21	+0.16
COMET	0.7807	0.7816	+0.0009
Terminology Accuracy	0.2162	0.2141	-0.0021

Run D — 2,668 dev samples assistant-only loss

Metric	S0 Dev	S1 Run D	Delta
BLEU (char)	22.06	5.26	-16.80
chrF++	17.05	10.62	-6.43
COMET	0.7807	0.6145	-0.1662
Terminology Accuracy	0.2162	0.2257	+0.0095

What We Learned

The translations got slightly better at looking like the reference, but not at actually meaning the right thing. Runs A and B improved BLEU (how similar the output looks to the reference), but COMET (whether the meaning is correct) barely moved — only +0.0012 on test. The model learned to mimic the style of the training data without deeper understanding.

The model got worse at using the right technical terms. Even after training on 24K translation pairs, the model didn’t learn to use our approved glossary terms like “パスワード” for “password”. Terminology accuracy actually dropped in Runs A and B. The training data taught general fluency, not domain-specific vocabulary.

We found you can’t have both. In Run D, we changed the training so the model only learned from the Japanese output (ignoring the English input during loss calculation). This was the only run that improved terminology — but it broke everything else. BLEU dropped from 22 to 5. The model got better at producing Japanese words but forgot how to connect them to the English meaning.

Training a model this small is like walking a tightrope. Run B crashed partway through training with NaN errors (the math literally broke). With only 500 million parameters, there’s very little room for the model to learn new things without destabilising what it already knows.

Bottom line: fine-tuning a small model gave us small improvements. The real gains came from giving the model access to external knowledge (S2 retrieval) and letting a bigger model make decisions (S3 agentic).

What Changed Between Runs

Hyperparameter differences across training runs

Parameter	Run A	Run B	Run D
Learning Rate	5e-6	3e-6	5e-6
LoRA Rank / Alpha	16 / 16	16 / 16	16 / 16
Loss Masking	full sequence	full sequence	assistant-only
Max Steps	600	600	300
Epochs	1	1	1
Gradient Clipping	0.1	0.1	0.1
Result	Best overall	Late NaN	Term ↑ Fluency ↓↓

Run A — baseline S1 config

Changed: nothing — this was the first stable run using the base config
Result: best dev metrics (BLEU +2.17, COMET +0.0051), stable training, no NaN
But: terminology accuracy dropped -0.0067
A separate final stable rerun (using the same base config) was used for the test_v1 S1 evaluation

Run B — lower learning rate

Changed: learning rate 5e-6 → 3e-6 (how fast the model learns per step — lower = slower, more cautious updates)
Why: Run A had worked, but earlier attempts had crashed mid-training. We thought slowing down the learning might prevent that
Result: worse on every metric (BLEU +1.18 vs +2.17) and still crashed with NaN anyway
Takeaway: learning slower didn’t help — the model just absorbed less from the training data in the same number of steps

Run D — assistant-only loss masking

Changed: how the model learns from each example. Normally it learns from the entire conversation (English input + Japanese output). In Run D, we told it to only learn from the Japanese output and ignore the English side during training
Why: we thought focusing purely on the Japanese would teach it better Japanese vocabulary and terminology
Win: it worked for terminology — the only run where the model used more correct glossary terms (+0.0095)
Cost: but the translations became nearly unusable. BLEU dropped from 22 to 5 and COMET from 0.78 to 0.61
Takeaway: the model could still see the English text, but since it wasn’t being trained on that connection, it lost the ability to translate coherently. It learned better Japanese words but forgot what they were supposed to mean in context

What is NaN instability?

What is NaN? It stands for “Not a Number”. During training, the model does millions of math operations. If one of those calculations produces an impossible result (like dividing by zero), it outputs NaN instead of a real number
Why is it a problem? Once NaN appears in one calculation, it spreads like a virus. Every subsequent calculation that touches it also becomes NaN. The model’s learned knowledge gets corrupted and it starts producing nonsense
What happened in our runs? Training would look completely normal for most of the run — loss going down, metrics improving. Then suddenly, without warning, the loss would spike to NaN and the model would be broken
Why is this tricky? Because the early checkpoints (snapshots we save during training) look fine. You think the run succeeded until you check the later checkpoints and realise they’re corrupted
Why does this happen with small models? We compressed the model to 4-bit precision to fit it on our hardware. This means every number is stored with less accuracy — like rounding to fewer decimal places. With less room for precision, small errors can snowball into NaN
How did we deal with it? We saved checkpoints frequently (every 100 steps), set a strict early-stopping rule, limited training to 300-600 steps, and used gradient clipping (capping how big each update can be). If a run hit NaN, we threw away the broken checkpoints and kept the last good one

Why other hyperparameters were held constant

LoRA rank (16): rank controls how much the adapter can learn. On a small 0.5B model, making the adapter too big (high rank) risks the model just memorising the training data instead of learning general patterns
Batch size (32 effective): this is how many examples the model sees before updating its weights. 32 was the most our hardware could handle — going higher would have required a bigger GPU
Max sequence length (2048): the maximum length of text the model can process at once. Most of our translation pairs were short (~100 characters), but some longer ones needed up to 500. We set a generous ceiling to be safe
Training length (1 epoch, 300-600 steps): we deliberately kept training short because longer runs crashed with NaN. One pass through half the data was safer than multiple passes through all of it
Dropout (0.05), clipping (0.1): these are safety mechanisms — dropout randomly disables parts of the model during training to prevent memorisation, clipping limits how big each weight update can be. Both were already set conservatively
Our approach: we changed one or two things per run so we could tell exactly what caused each result, rather than changing everything at once and guessing

Training Timeline

Phase	What happened
Setup	Established S0 baseline with Qwen 2.5-0.5B-Instruct. Configured QLoRA (rank 16, NF4, lr 5e-6). Froze splits: train_v2 (24K), dev_v2 (2,668), test_v1 (4,254).
Run A	Best dev metrics: BLEU +2.17, COMET +0.0051. Stable training, no NaN. Terminology accuracy declined slightly (-0.0067). Promoted to test_v1 evaluation.
Run B	Lowered lr to 3e-6 hoping for stability. Underperformed Run A on all metrics. Developed late NaN at later checkpoints. Rejected.
Run D	Switched to assistant-only loss masking. Only run to improve terminology (+0.0095). But BLEU collapsed to 5.26 and COMET to 0.6145. Confirmed the fluency-vs-terminology tradeoff.
Conclusion	0.5B fine-tuning hit a ceiling. A final stable rerun (same base config as Run A) was used as S1. Motivated pivot to retrieval (S2) and agentic (S3) approaches.

Key Findings

Small but real gains

Fine-tuning improved BLEU and chrF++ consistently, but COMET gains were marginal (+0.0012 on test).

Terminology gap persists

Fine-tuning did not improve glossary compliance. Accuracy dropped slightly in every run (-0.0009 to -0.0067).

Fluency vs terminology tradeoff

Run D (assistant-only loss) improved terminology +0.0095 but destroyed fluency (BLEU -16.8). In our experiments, optimizing for both pulled in opposite directions.

Motivated the pivot to S2/S3

Fine-tuning alone hit a ceiling. Retrieval (S2) and agentic methods (S3) were needed to achieve real quality gains: COMET jumped from 0.73 to 0.95.

Frequently Asked Questions

Why 0.5B instead of the recommended 7B model? +

The project kickoff recommended 7B-8B, but we switched to 0.5B because it was faster to iterate on and fit our local GPU constraints
Our available VPS did not have enough VRAM for 7B QLoRA training with the full dataset
The tradeoff is acknowledged in the report — 0.5B has less capacity for learning, which likely contributed to the flat S0 vs S1 results

Why QLoRA instead of full fine-tuning? +

QLoRA was the intended S1 method from the project specification
It uses 4-bit quantization of the base model and only trains small adapter weights, reducing memory requirements significantly
We did not run a head-to-head full-fine-tuning baseline, so we cannot claim QLoRA was proven better — it was chosen for practical feasibility

Why only 1 epoch with 300-600 step caps? +

The 0.5B model showed late NaN instability on longer runs — training would look healthy and then suddenly collapse
Shorter, conservative runs with frequent evaluation (every 100 steps) and early stopping (patience 1) were more reliable
With 24K training examples and effective batch size 32, one full epoch is ~750 steps. Capping at 300-600 meant we used roughly half the data per run, but this was a deliberate stability tradeoff

Why not increase LoRA rank beyond 16? +

On a 0.5B model, the adapter at rank 16 already represents a relatively large fraction of the model’s total parameters
Increasing rank risks overfitting to the training data, especially with only 24K examples
We did not run a rank-ablation study, so this is based on standard QLoRA guidance rather than empirical proof from our runs

Why not try a different base model (Llama, Gemma)? +

The S0/S1 comparison was designed as a controlled experiment with one backbone model to isolate the effect of fine-tuning
Qwen 2.5 was chosen because it has strong multilingual support including Japanese
Testing other base models was out of scope for this rerun — we cannot claim Qwen was proven superior, only that it was a reasonable choice for EN-JA translation

Why no separate error-identification adapter? +

The project spec recommended separate translation and error-ID adapters “if time allows”
We prioritised getting the translation adapter working and stable first
A config file for error-ID training exists (finetune_error_id.yaml) but was not part of the locked S0/S1 rerun path

What would you do differently with more time? +

Rerun at the originally recommended 7B/8B scale on proper GPU hardware — larger models have more capacity for learning translation patterns
Build the separate error-identification adapter that was deferred
Run a LoRA rank ablation study (rank 8 vs 16 vs 32 vs 64) to find the optimal adapter size
Continue investing in retrieval and agentic approaches, since the results show those delivered the largest quality gains

How does S1 compare to just using ChatGPT or Claude directly? +

We did not run a direct prompt-only ChatGPT or Claude baseline for S0/S1, so we cannot make a clean comparison
The closest evidence is S3, which uses Claude Sonnet 4.6 in an agentic RAG pipeline and achieves much higher quality (COMET 0.9552 vs S1’s 0.7303)
However, S3 is not apples-to-apples — it uses additional tools, retrieval, and self-audit that a plain ChatGPT/Claude prompt would not have
The honest answer: a SOTA model with good prompting would likely outperform a fine-tuned 0.5B model, but the point of S0/S1 was to study what fine-tuning alone can do at small scale

S0 → S1: Fine-Tuning