S2: Advanced RAG

Weighted multi-strata retrieval with cross-encoder reranking. Queries four knowledge sources in parallel, reranks for precision, then generates with high-signal context.

Pipeline
1Query
2Retrieve
3Rerank
4Generate
Retrieval Strata
4 parallel vector queries from AWS S3 Vectors
EN-JA Parallel Corpus
46,616 chunks · top-3
Gemini Annotated Examples
499 chunks · top-2
Grammar Guide
81 chunks · top-1
JTF Style Guide
181 chunks · top-1
Evaluation Results
BLEU
64.93
char tokenized
ChrF++
67.81
word order 2
COMET
0.9416
wmt22-da
Hit@K
0.998
retrieval
499 samplesBAAI bge-reranker-v2-m3
MetricValue
Avg Retrieval (ms)2,241
Avg Coverage Score0.6121
Terminology Samples311
Terminology Terms Total552
Terminology Correct Terms57
Terminology Accuracy0.1033
Retrieval Recall@K0.999
Error Binary F10.5254
Error Category Macro F10.0
Strengths & Limitations

Best surface-level scores across all variants. BLEU 64.93 and chrF++ 67.81 mean the translations closely match the reference wording. The reranker filters out noise from the large parallel corpus, keeping only the most relevant examples as context for the model.

Strong meaning preservation. COMET reaches 0.9416 — a big jump from S0/S1’s ~0.73. This means the model isn’t just copying surface patterns, it’s actually producing translations that preserve the original meaning well.

Near-perfect retrieval. Hit@K 0.998 and Recall@K 0.999 — the weighted multi-strata approach queries all four knowledge sources every time, so it almost always finds relevant context. The tradeoff is that this brute-force approach is slower than selective retrieval.

Terminology accuracy is still low at 0.1033. Even though the retrieval finds glossary entries, the model doesn’t always use the approved terms. It sees “password → パスワード” in the context but might still choose a different word. Retrieval gets the right information to the model, but doesn’t force compliance.

Error detection is a weakness. Error Category Macro F1 is 0.0 — the system’s error classifier doesn’t align with the gold labels at all on a per-category basis. Binary F1 of 0.5254 shows it catches roughly half of actual errors, but categorises them incorrectly.

Try the S3 Agentic Translator