Evaluation Metrics

Quantitative comparison across all four system variants — S0 baseline through S3 agentic RAG.

Best COMET
0.9416
S2 RAG
Best BLEU
64.93
S2 RAG
Best Term Accuracy
0.6092
S3 Agentic
Best chrF++
67.81
S2 RAG
Translation Quality
BLEU, chrF++, COMET (×100) across all variants
S0 Baseline
S1 Fine-Tuned
S2 RAG
S3 Agentic
COMET Progression
Semantic quality score — scale 0.6 to 1.0wmt22-comet-da
Terminology Accuracy
Glossary term compliance rate
Retrieval Metrics — S2 vs S3
S2 RAG 499 samples
Hit@K0.998
Recall@K0.999
Coverage0.6121
Retrieval Latency2,241 ms
S3 Agentic 250 samples
Hit@K0.2209
Recall@K0.2239
Coverage0.8294
Retrieval Latency566.6 ms
Error Detection — S2 vs S3
S2
Binary F10.5254
Category Macro F10.0
S3
Binary F10.3502
Category Macro F10.0866
Latency & Speed
Average latency per sample (ms)lower is better
Complete Results Table
MetricS0S1S2S3
Num Samples42544254499250
BLEU (char)16.2917.6464.9346.45
chrF++12.7612.9167.8136.8
COMET0.72910.73030.94160.9284
Term Accuracy0.21250.22570.10330.6092
Avg Retrieval (ms)2241566.6
Coverage Score0.61210.8294
Retrieval Hit@K0.9980.2209
Retrieval Recall@K0.9990.2239
Error Binary F10.52540.3502
Error Cat Macro F10.00.0866
Sample size caveat: S0 and S1 were evaluated on 4,254 samples (test_v1), S2 on 499 samples, and S3 on 250 samples. Cross-column comparisons should be interpreted cautiously due to different evaluation set sizes and compositions.
Why do terminology sample counts vary so much? The glossary CSV originally contained 526 entries, but ~200 of those were contaminated rows with garbage Japanese values scraped from web pages (e.g. “sign”, “live”, “business” mapped to meaningless strings). After filtering to only clean, domain-specific tech terms (password, login, account, settings, etc.), the glossary dropped to ~320 curated entries. These real terms appear far less often in general-domain test sentences (Tatoeba, JParaCrawl), so fewer sentences qualify as terminology samples. The lower count is the honest number — the higher counts in earlier runs were inflated by noise from the contaminated glossary.
Domain mismatch note: Our glossary is built around tech/UI terminology (password, settings, download, etc.), but the test set draws from general-domain corpora. This means many test sentences simply don’t contain any glossary terms. Despite this mismatch, S3’s terminology accuracy (0.6092) is still the highest across all variants — when a glossary term does appear, the agentic pipeline uses the approved form more reliably than any other system.
Conclusion
S0/S1 Fine-tuning a 0.5B model gave marginal gains. COMET +0.0012. Not enough.
S2 Retrieval was the biggest jump. COMET 0.73 → 0.94. Showing good examples beats memorisation.
S3 Agentic loop solved terminology. 6× better term accuracy than S2. Slower, but more reliable.
Retrieval and agency outperformed fine-tuning by a wide margin. The ideal production system would combine all three: fine-tuned model for fluency, retrieval for domain knowledge, agentic loop for quality assurance.
With more time
7B+ fine-tuning combined S1+S3 pipeline domain-matched eval sets error-ID adapter S3 latency optimisation