Complete Results Table
| Metric | S0 | S1 | S2 | S3 |
| Num Samples | 4254 | 4254 | 499 | 250 |
| BLEU (char) | 16.29 | 17.64 | 64.93 | 46.45 |
| chrF++ | 12.76 | 12.91 | 67.81 | 36.8 |
| COMET | 0.7291 | 0.7303 | 0.9416 | 0.9284 |
| Term Accuracy | 0.2125 | 0.2257 | 0.1033 | 0.6092 |
| Avg Retrieval (ms) | — | — | 2241 | 566.6 |
| Coverage Score | — | — | 0.6121 | 0.8294 |
| Retrieval Hit@K | — | — | 0.998 | 0.2209 |
| Retrieval Recall@K | — | — | 0.999 | 0.2239 |
| Error Binary F1 | — | — | 0.5254 | 0.3502 |
| Error Cat Macro F1 | — | — | 0.0 | 0.0866 |
⚠
Sample size caveat: S0 and S1 were evaluated on 4,254 samples (test_v1), S2 on 499 samples, and S3 on 250 samples. Cross-column comparisons should be interpreted cautiously due to different evaluation set sizes and compositions.
ⓘ
Why do terminology sample counts vary so much? The glossary CSV originally contained 526 entries, but ~200 of those were contaminated rows with garbage Japanese values scraped from web pages (e.g. “sign”, “live”, “business” mapped to meaningless strings). After filtering to only clean, domain-specific tech terms (password, login, account, settings, etc.), the glossary dropped to ~320 curated entries. These real terms appear far less often in general-domain test sentences (Tatoeba, JParaCrawl), so fewer sentences qualify as terminology samples. The lower count is the honest number — the higher counts in earlier runs were inflated by noise from the contaminated glossary.
ⓘ
Domain mismatch note: Our glossary is built around tech/UI terminology (password, settings, download, etc.), but the test set draws from general-domain corpora. This means many test sentences simply don’t contain any glossary terms. Despite this mismatch, S3’s terminology accuracy (0.6092) is still the highest across all variants — when a glossary term does appear, the agentic pipeline uses the approved form more reliably than any other system.