Weighted multi-strata retrieval with cross-encoder reranking. Queries four knowledge sources in parallel, reranks for precision, then generates with high-signal context.
| Metric | Value |
|---|---|
| Avg Retrieval (ms) | 2,241 |
| Avg Coverage Score | 0.6121 |
| Terminology Samples | 311 |
| Terminology Terms Total | 552 |
| Terminology Correct Terms | 57 |
| Terminology Accuracy | 0.1033 |
| Retrieval Recall@K | 0.999 |
| Error Binary F1 | 0.5254 |
| Error Category Macro F1 | 0.0 |
Best surface-level scores across all variants. BLEU 64.93 and chrF++ 67.81 mean the translations closely match the reference wording. The reranker filters out noise from the large parallel corpus, keeping only the most relevant examples as context for the model.
Strong meaning preservation. COMET reaches 0.9416 — a big jump from S0/S1’s ~0.73. This means the model isn’t just copying surface patterns, it’s actually producing translations that preserve the original meaning well.
Near-perfect retrieval. Hit@K 0.998 and Recall@K 0.999 — the weighted multi-strata approach queries all four knowledge sources every time, so it almost always finds relevant context. The tradeoff is that this brute-force approach is slower than selective retrieval.
Terminology accuracy is still low at 0.1033. Even though the retrieval finds glossary entries, the model doesn’t always use the approved terms. It sees “password → パスワード” in the context but might still choose a different word. Retrieval gets the right information to the model, but doesn’t force compliance.
Error detection is a weakness. Error Category Macro F1 is 0.0 — the system’s error classifier doesn’t align with the gold labels at all on a per-category basis. Binary F1 of 0.5254 shows it catches roughly half of actual errors, but categorises them incorrectly.