[Question]: Cross-language Retrieval Issue in RAGFlow
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
Describe your problem
We're experiencing a cross-language retrieval issue in RAGFlow: When the knowledge base contains only English documents, cross-language queries work properly and can successfully retrieve relevant English documents. However, after adding Chinese documents (which contain the relevant keywords) to the same knowledge base, the system can no longer find the same documents when asking identical questions. The retrieval scores for English documents are significantly lower than those for Chinese documents in test queries. What adjustments should we make to resolve this cross-language retrieval performance issue?
The main cause of your cross-language retrieval issue is likely the embedding model configuration. RAGFlow requires a multilingual embedding model for effective retrieval across languages; if you use a monolingual model (e.g., English-only), adding Chinese documents will degrade retrieval for English queries and vice versa. All documents in a knowledge base must use the same embedding model, and this model cannot be changed once chunks exist—so mixing languages with a monolingual model leads to poor cross-language performance [source].
To resolve this:
-
Check your current embedding model: In the RAGFlow UI or via API, verify which embedding model (
embd_id) your knowledge base uses. If it's not a multilingual model (like BAAI/BGE-m3, Qwen/text-embedding-v4, Cohere embed-multilingual, Voyage voyage-multilingual, or intfloat multilingual-e5), you will need to migrate. - Switch to a multilingual embedding model: If your knowledge base uses a monolingual model, you must delete all existing chunks and re-ingest your documents using a multilingual embedding model. This ensures all embeddings are in the same vector space and support cross-language queries [source].
-
Recommended models: Use models such as
BAAI/BGE-m3,text-embedding-v4(Qwen), or other supported multilingual models listed in your RAGFlow configuration [source]. -
Environment and analyzer setup: If you use Elasticsearch, configure it with a Chinese analyzer (IK or smartcn) to ensure proper segmentation for Chinese text. Also, set the
LANGenvironment variable tozh_CN.UTF-8before creating the knowledge base and restart the service. ChangingLANGafter creation does not update the KB language retroactively [source]. - Tune retrieval parameters: Adjust similarity thresholds and vector similarity weights in retrieval settings to optimize results for your multilingual dataset.
-
Verify embedding effectiveness: Use the
/check_embeddingAPI endpoint to confirm your embedding model is producing effective vectors for your knowledge base. An average cosine similarity above 0.99 indicates good embedding quality [source]. - Upgrade if needed: Recent RAGFlow releases (v0.22.1+) include important fixes for embedding model availability and cross-language search. Make sure you are on the latest version [source].
Note: Even with a multilingual embedding model, true cross-language retrieval (e.g., querying in Chinese to retrieve English documents) is not fully supported in all scenarios as of v0.22.1; queries may still favor documents in the same language [source].
For more details on embedding model configuration and multilingual support, see the official documentation.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
What about selecting both?
Thank you for your answer. I tried selecting both Chinese and English for cross-lingual search, but it still didn't work. My knowledge base primarily consists of Chinese documents with a small number of English documents. Are there any effective solutions to this problem? And would setting up prompt words be effective?
@yanzhen1031 What's your version or commit id? Can you share the document with me via [email protected]