ragflow [Bug]: Auto-question in Spanish when knowledge is in English

Is there an existing issue for the same bug?

[X] I have checked the existing issues.

RAGFlow workspace code commit ID

a5cf6fc

RAGFlow image version

v0.15.0 slim

Other environment information

Running in WSL2 on Docker, with CUDA.

Actual behavior

I uploaded about 200 documents, all letters from the same organization, all in the English language, in Word format. The organization is in the United States. I asked it to create 4 auto-keywords and 1 auto-question per chunk. The keywords for each chunk were created in English, but the questions were all auto-created in Spanish. I can't see the whole question in the UI, but I can read enough to know that the question is relevant - it's just not in English! Is there any way to fix this? I have checked multiple files in this knowledge base, and every chunk I have looked at has a question in Spanish. All my localizations also appear to be set to English, wherever I can find them.

Expected behavior

I expected the auto-generated questions to be in English, or the same language the documents are in.

Steps to reproduce

Make a new dataset.
Make the selections as above - 4 auto-keywords, 1 auto-question per chunk.
Upload several Word .docx files, all in English.
Select the files and bulk parse them.
Check any one of the files and double click on a chunk. The question will be in Spanish, while everything else is in English.

Additional information

No response

Dec 23 '24 23:12 rhambus

What LLM did you choose? It primarily depends on LLM.

Dec 24 '24 02:12 KevinHuSh

OpenAI’s ChatGPT 4o. It’s set to English too, though, as far as I can determine.

On Mon, Dec 23, 2024 at 9:08 PM Kevin Hu @.***> wrote:

What LLM did you choose? It primarily depends on LLM.

— Reply to this email directly, view it on GitHub https://github.com/infiniflow/ragflow/issues/4198#issuecomment-2560539036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE2PQVCBVCTHOBURNTEPGCL2HC6ZJAVCNFSM6AAAAABUDY43YOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRQGUZTSMBTGY . You are receiving this because you authored the thread.Message ID: @.***>

Dec 24 '24 19:12 rhambus

I am no programmer, but I looked at the code for anything having to do with "question" and found this: https://github.com/infiniflow/ragflow/blob/main/web/src/pages/flow/form/rewrite-question-form/index.tsx

It has two mentions of "useTranslate" in it. Could that be what is causing this?

import LLMSelect from '@/components/llm-select';
import { useTranslate } from '@/hooks/common-hooks';
import { Form, InputNumber } from 'antd';
import { IOperatorForm } from '../../interface';

const RewriteQuestionForm = ({ onValuesChange, form }: IOperatorForm) => {
  const { t } = useTranslate('chat');

  return (
    <Form
      name="basic"
      labelCol={{ span: 4 }}
      wrapperCol={{ span: 20 }}
      onValuesChange={onValuesChange}
      autoComplete="off"
      form={form}
    >
      <Form.Item
        name={'llm_id'}
        label={t('model', { keyPrefix: 'chat' })}
        tooltip={t('modelTip', { keyPrefix: 'chat' })}
      >
        <LLMSelect></LLMSelect>
      </Form.Item>
      <Form.Item
        label={t('loop', { keyPrefix: 'flow' })}
        name="loop"
        initialValue={1}
      >
        <InputNumber />
      </Form.Item>
    </Form>
  );
};

export default RewriteQuestionForm;

I looked at some newer chunks I generated from a different dataset and found that one of them also had Spanish keywords generated, though most were English. https://github.com/infiniflow/ragflow/blob/main/web/src/pages/flow/form/keyword-extract-form/index.tsx also has a mention of "useTranslate" in it.

Dec 28 '24 14:12 rhambus

Hi, I have a weird behaviour that might be connected. running V0.16.

When creating KB, If I use any type of OCR: DeepDoc or OpenAI model, on this one PDF that is in English I get gibberish:

ThMee diDciarlei crste osrp ofnos ri:b le EnsuarlCilPn SgpO o liacnipder so cedauswr eea lsla, p plilcaawibsnl cel OundtianRrgei gou l enacptuerds tuoaS ntta taurtaeed ,h etroie tndh oep eraotfti hpoern e m Revieawnisdni gg,na inan ngn udaelc laorfha itsi/orhnee srp ons Ensutrhipanatgt ireenctoa rredess t abalnimdsa hiendt aaicnceudrl,ae tgeic,bol mep,lf eotleal consifsotremmnaette l,te gisrleaqtuiivreea mnaeddn htetsro,te h C eP SMOe diRceaclop rodl Revieuwpidnagta,in indmg p,l emoefOn tHiPpn ogl iacnipder so ceidtnuh rfeeo sl laowrie o Admini

If I use "Text" type - then it picks up correct text from PDF.

KB setting is in English.

Another issue, it sometimes randomly breaks up words by adding a space and also doesn't handle dash correctly when parts of the word were on separate lines in original document. Is there a way to overcome this?

Thanks! Alex

Feb 14 '25 23:02 alexff77