[Bug]: Auto-question in Spanish when knowledge is in English
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
RAGFlow workspace code commit ID
a5cf6fc
RAGFlow image version
v0.15.0 slim
Other environment information
Running in WSL2 on Docker, with CUDA.
Actual behavior
I uploaded about 200 documents, all letters from the same organization, all in the English language, in Word format. The organization is in the United States. I asked it to create 4 auto-keywords and 1 auto-question per chunk. The keywords for each chunk were created in English, but the questions were all auto-created in Spanish. I can't see the whole question in the UI, but I can read enough to know that the question is relevant - it's just not in English! Is there any way to fix this? I have checked multiple files in this knowledge base, and every chunk I have looked at has a question in Spanish. All my localizations also appear to be set to English, wherever I can find them.
Expected behavior
I expected the auto-generated questions to be in English, or the same language the documents are in.
Steps to reproduce
Make a new dataset.
Make the selections as above - 4 auto-keywords, 1 auto-question per chunk.
Upload several Word .docx files, all in English.
Select the files and bulk parse them.
Check any one of the files and double click on a chunk. The question will be in Spanish, while everything else is in English.
Additional information
No response
What LLM did you choose? It primarily depends on LLM.
OpenAI’s ChatGPT 4o. It’s set to English too, though, as far as I can determine.
On Mon, Dec 23, 2024 at 9:08 PM Kevin Hu @.***> wrote:
What LLM did you choose? It primarily depends on LLM.
— Reply to this email directly, view it on GitHub https://github.com/infiniflow/ragflow/issues/4198#issuecomment-2560539036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE2PQVCBVCTHOBURNTEPGCL2HC6ZJAVCNFSM6AAAAABUDY43YOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRQGUZTSMBTGY . You are receiving this because you authored the thread.Message ID: @.***>
I am no programmer, but I looked at the code for anything having to do with "question" and found this: https://github.com/infiniflow/ragflow/blob/main/web/src/pages/flow/form/rewrite-question-form/index.tsx
It has two mentions of "useTranslate" in it. Could that be what is causing this?
import LLMSelect from '@/components/llm-select';
import { useTranslate } from '@/hooks/common-hooks';
import { Form, InputNumber } from 'antd';
import { IOperatorForm } from '../../interface';
const RewriteQuestionForm = ({ onValuesChange, form }: IOperatorForm) => {
const { t } = useTranslate('chat');
return (
<Form
name="basic"
labelCol={{ span: 4 }}
wrapperCol={{ span: 20 }}
onValuesChange={onValuesChange}
autoComplete="off"
form={form}
>
<Form.Item
name={'llm_id'}
label={t('model', { keyPrefix: 'chat' })}
tooltip={t('modelTip', { keyPrefix: 'chat' })}
>
<LLMSelect></LLMSelect>
</Form.Item>
<Form.Item
label={t('loop', { keyPrefix: 'flow' })}
name="loop"
initialValue={1}
>
<InputNumber />
</Form.Item>
</Form>
);
};
export default RewriteQuestionForm;
I looked at some newer chunks I generated from a different dataset and found that one of them also had Spanish keywords generated, though most were English. https://github.com/infiniflow/ragflow/blob/main/web/src/pages/flow/form/keyword-extract-form/index.tsx also has a mention of "useTranslate" in it.
Hi, I have a weird behaviour that might be connected. running V0.16.
When creating KB, If I use any type of OCR: DeepDoc or OpenAI model, on this one PDF that is in English I get gibberish:
ThMee diDciarlei crste osrp ofnos ri:b le EnsuarlCilPn SgpO o liacnipder so cedauswr eea lsla, p plilcaawibsnl cel OundtianRrgei gou l enacptuerds tuoaS ntta taurtaeed ,h etroie tndh oep eraotfti hpoern e m Revieawnisdni gg,na inan ngn udaelc laorfha itsi/orhnee srp ons Ensutrhipanatgt ireenctoa rredess t abalnimdsa hiendt aaicnceudrl,ae tgeic,bol mep,lf eotleal consifsotremmnaette l,te gisrleaqtuiivreea mnaeddn htetsro,te h C eP SMOe diRceaclop rodl Revieuwpidnagta,in indmg p,l emoefOn tHiPpn ogl iacnipder so ceidtnuh rfeeo sl laowrie o Admini
If I use "Text" type - then it picks up correct text from PDF.
KB setting is in English.
- Another issue, it sometimes randomly breaks up words by adding a space and also doesn't handle dash correctly when parts of the word were on separate lines in original document. Is there a way to overcome this?
Thanks! Alex