[Question]:The parsing language of the uploaded document is inaccurate
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
Describe your problem
My dataset is in Chinese, but there is no document language option in the configuration. I searched for the previous question and saw that the language would be automatically selected based on the content of the document, but I noticed that the language of the log document output by the backend is English. The content of the knowledge graph I constructed is in English. May I ask how to solve this problem.
Hello @0zxq0! I'm Dosu, your friendly bot here to help you solve bugs, answer questions, and get you on your way to becoming a contributor while we wait for a human maintainer. Let's tackle your issue together! 🤖
I found a similar open issue titled "[Question]: the language preference of knowledge graph chunk method" which might be relevant to your problem [1]. This issue has no comments or references, but it indicates that language preferences in knowledge graph methods are a topic of interest.
To address your problem, RAGFlow uses server-side language detection with os.getenv("LANG") to set the default language. If the environment variable LANG contains 'zh_CN', the default language is set to 'Chinese'; otherwise, it defaults to 'English'. Additionally, RAGFlow dynamically determines the language by detecting the content itself during processing, which means it can handle multilingual datasets automatically without requiring manual language specification [2].
For embedding model selection, using models like BAAI/bge-large-zh-v1.5@BAAI or maidalun1020/bce-embedding-base_v1@Youdao is recommended for Chinese language datasets. These models ensure that files in a specific knowledge base are converted to embeddings using the same model, which is crucial for comparing them in the same embedding space [3][4].
Ensure your server's environment variable LANG is set to 'zh_CN' to default to Chinese processing, and update the knowledge configuration to specify 'Chinese' as the document language. This setup will help in correctly processing your Chinese dataset while the log document output and knowledge graph content remain in English [2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
You could edit lots of hardcoded prompt in individual python file, translate them to chinese to solve this problem
自检
- [x] 我已搜索现有问题,搜索现有问题,包括已关闭的问题。[x] 我确认我使用英语提交此报告(语言政策)。[x] 非英文标题提交将被直接关闭 ( 非英文标题的提交将会被直接关闭 ) (语言政策)。[x] 请不要修改此模板:)并填写所有必填字段。
描述您的问题
我的数据集是中文的,但配置中没有文档语言选项。我搜索了上一个问题,看到会根据文档内容自动选择语言,但我注意到后端输出的日志文档的语言是英文。我构建的知识图谱的内容是英文的。请问如何解决这个问题。
![]()
我的也是这种问题,自动生成的问题有时是中文,有时是英文
I have the same problem; the automatically generated questions are sometimes in Chinese and sometimes in English.
You could edit lots of hardcoded prompt in individual python file, translate them to chinese to solve this problem
Could you let me know which prompt is used by the picture parser to let LLM describe pictures? I found some prompts in prompts.py, but I‘m not very sure.
Translate prompts below