Evaluation Metrics

Quantitative comparison across all four system variants — S0 baseline through S3 agentic RAG.

Best COMET

0.9416

S2 RAG

Best BLEU

64.93

S2 RAG

Best Term Accuracy

0.6092

S3 Agentic

Best chrF++

67.81

S2 RAG

Translation Quality

BLEU, chrF++, COMET (×100) across all variants

S0 Baseline

S1 Fine-Tuned

S2 RAG

S3 Agentic

COMET Progression

Semantic quality score — scale 0.6 to 1.0wmt22-comet-da

Terminology Accuracy

Glossary term compliance rate

Retrieval Metrics — S2 vs S3

S2 RAG 499 samples

Hit@K0.998

Recall@K0.999

Coverage0.6121

Retrieval Latency2,241 ms

S3 Agentic 250 samples

Hit@K0.2209

Recall@K0.2239

Coverage0.8294

Retrieval Latency566.6 ms

Error Detection — S2 vs S3

Binary F10.5254

Category Macro F10.0

Binary F10.3502

Category Macro F10.0866

Latency & Speed

Average latency per sample (ms)lower is better

Complete Results Table

Metric	S0	S1	S2	S3
Num Samples	4254	4254	499	250
BLEU (char)	16.29	17.64	64.93	46.45
chrF++	12.76	12.91	67.81	36.8
COMET	0.7291	0.7303	0.9416	0.9284
Term Accuracy	0.2125	0.2257	0.1033	0.6092
Avg Retrieval (ms)	—	—	2241	566.6
Coverage Score	—	—	0.6121	0.8294
Retrieval Hit@K	—	—	0.998	0.2209
Retrieval Recall@K	—	—	0.999	0.2239
Error Binary F1	—	—	0.5254	0.3502
Error Cat Macro F1	—	—	0.0	0.0866

⚠ Sample size caveat: S0 and S1 were evaluated on 4,254 samples (test_v1), S2 on 499 samples, and S3 on 250 samples. Cross-column comparisons should be interpreted cautiously due to different evaluation set sizes and compositions.

ⓘ Why do terminology sample counts vary so much? The glossary CSV originally contained 526 entries, but ~200 of those were contaminated rows with garbage Japanese values scraped from web pages (e.g. “sign”, “live”, “business” mapped to meaningless strings). After filtering to only clean, domain-specific tech terms (password, login, account, settings, etc.), the glossary dropped to ~320 curated entries. These real terms appear far less often in general-domain test sentences (Tatoeba, JParaCrawl), so fewer sentences qualify as terminology samples. The lower count is the honest number — the higher counts in earlier runs were inflated by noise from the contaminated glossary.

ⓘ Domain mismatch note: Our glossary is built around tech/UI terminology (password, settings, download, etc.), but the test set draws from general-domain corpora. This means many test sentences simply don’t contain any glossary terms. Despite this mismatch, S3’s terminology accuracy (0.6092) is still the highest across all variants — when a glossary term does appear, the agentic pipeline uses the approved form more reliably than any other system.

Conclusion

S0/S1 Fine-tuning a 0.5B model gave marginal gains. COMET +0.0012. Not enough.

S2 Retrieval was the biggest jump. COMET 0.73 → 0.94. Showing good examples beats memorisation.

S3 Agentic loop solved terminology. 6× better term accuracy than S2. Slower, but more reliable.

Retrieval and agency outperformed fine-tuning by a wide margin. The ideal production system would combine all three: fine-tuned model for fluency, retrieval for domain knowledge, agentic loop for quality assurance.

With more time

7B+ fine-tuning combined S1+S3 pipeline domain-matched eval sets error-ID adapter S3 latency optimisation