S2 Advanced RAG — EN→JA Translation

Pipeline

1Query

→

2Retrieve

→

3Rerank

→

4Generate

Retrieval Strata

4 parallel vector queries from AWS S3 Vectors

EN-JA Parallel Corpus

46,616 chunks · top-3

Gemini Annotated Examples

499 chunks · top-2

Grammar Guide

81 chunks · top-1

JTF Style Guide

181 chunks · top-1

Evaluation Results

BLEU

64.93

char tokenized

ChrF++

67.81

word order 2

COMET

0.9416

wmt22-da

Hit@K

0.998

retrieval

499 samplesBAAI bge-reranker-v2-m3

Metric	Value
Avg Retrieval (ms)	2,241
Avg Coverage Score	0.6121
Terminology Samples	311
Terminology Terms Total	552
Terminology Correct Terms	57
Terminology Accuracy	0.1033
Retrieval Recall@K	0.999
Error Binary F1	0.5254
Error Category Macro F1	0.0

Strengths & Limitations

Best surface-level scores across all variants. BLEU 64.93 and chrF++ 67.81 mean the translations closely match the reference wording. The reranker filters out noise from the large parallel corpus, keeping only the most relevant examples as context for the model.

Strong meaning preservation. COMET reaches 0.9416 — a big jump from S0/S1’s ~0.73. This means the model isn’t just copying surface patterns, it’s actually producing translations that preserve the original meaning well.

Near-perfect retrieval. Hit@K 0.998 and Recall@K 0.999 — the weighted multi-strata approach queries all four knowledge sources every time, so it almost always finds relevant context. The tradeoff is that this brute-force approach is slower than selective retrieval.

Terminology accuracy is still low at 0.1033. Even though the retrieval finds glossary entries, the model doesn’t always use the approved terms. It sees “password → パスワード” in the context but might still choose a different word. Retrieval gets the right information to the model, but doesn’t force compliance.

Error detection is a weakness. Error Category Macro F1 is 0.0 — the system’s error classifier doesn’t align with the gold labels at all on a per-category basis. Binary F1 of 0.5254 shows it catches roughly half of actual errors, but categorises them incorrectly.

S2: Advanced RAG