dify icon indicating copy to clipboard operation
dify copied to clipboard

Problems indexing knowledge and accentuation documents in PT-BR

Open dersonbsb2022 opened this issue 1 year ago • 4 comments

Self Checks

  • [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.9.1-fix1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

When inserting a document in Brazilian Portuguese, the entire indexing process goes smoothly. The result in question and answer format is almost all correct, the only problem is with the formation of the keywords.

Words with accentuation in Portuguese such as:

Example: portões - port es horários - hor rios Creating wrong keywords that hinder the search. After I correct these words, the results improve considerably.

To convert the document, I use openai-4o-mini index-mode: Segmenting in Question & Answer format Segment using Portuguese embedding model - text-embedding-3-small Hybrid search top k - 6 - rerank model - cohere free - multilingual - v3.0

The question about the bug is:

is it a bug in dify? is it a bug in openai? is it a bug in cohere?

I don't know what steps I can take to try to solve the problem. No further information appears in the logs even when put in DEBUG mode. Captura de Tela 2024-10-13 às 19 07 07 Captura de Tela 2024-10-13 às 19 05 26

✔️ Expected Behavior

that the keywords were generated correctly

Examples: <Portões> <Horários>

or that they were <Portoes> <Horarios>

and not broken

❌ Actual Behavior

the generated keywords are port es ho rios

Removing important letters, adding a space in place of the accented letter and leaving the keywords meaningless.

dersonbsb2022 avatar Oct 13 '24 22:10 dersonbsb2022

Olá @dersonbsb2022! Eu sou um bot que pode te ajudar a resolver bugs, responder perguntas e contribuir. Estou aqui para ajudar enquanto você aguarda um mantenedor humano.

I found a similar discussion regarding issues with accented characters in keyword generation, specifically in French documents [1]. This might indicate a broader issue with handling accented characters in Dify.

For your specific case with Portuguese documents, the problem could be related to the text-embedding-3-small model or the rerank-multilingual-v3.0 model, as these are involved in the indexing and retrieval process. Unfortunately, there isn't specific information available about how these models handle accented characters.

To address this issue, you might consider:

  1. Checking Model Documentation: Look into the documentation for the models you're using to see if there are known limitations with accented characters.
  2. Preprocessing Text: As a temporary workaround, preprocess your text to normalize accented characters before indexing.
  3. Alternative Models: Experiment with different models that might handle accented characters better.

If the issue persists, it might be worth reporting it as a potential bug in the Dify repository, specifying the models and configurations you're using.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Oct 13 '24 22:10 dosubot[bot]

https://github.com/langgenius/dify/blob/9275760599abf6df762b6f8babf7eba54554119d/docker/.env.example#L505-L508

What did you set?

If you set the ETL_TYPE to dify then I guess this will happens.

crazywoola avatar Oct 14 '24 00:10 crazywoola

First of all, thank you for the wonderful project and for your response. @crazywoola Sorry for the delay in responding but during this time I'm changed all possible settings, I even forked the EasyPanel project to configure Dify with Unstructured, but without success. Although the service started, I uploaded several files, docx, pdf, and everything continued to be processed by Dify, I don't know what I did wrong.

I will follow your suggestion and pre-process the documents before sending them.

Thank you very much.

dersonbsb2022 avatar Oct 15 '24 21:10 dersonbsb2022

Just to document, even though I pre-process the document and upload it to the knowledge base, the task when executed, I believe it is OpenAI, returns all the accents and the problem continues.

dersonbsb2022 avatar Oct 19 '24 12:10 dersonbsb2022

Hi, @dersonbsb2022. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

  • The issue involves incorrect handling of accented characters in Brazilian Portuguese documents in Dify version 0.9.1-fix1.
  • Suggested solutions included checking model documentation, preprocessing text, and trying alternative models.
  • @crazywoola mentioned the ETL_TYPE setting as a potential factor.
  • Despite preprocessing attempts, the issue persists, possibly due to OpenAI's processing.

Next Steps

  • Please confirm if this issue is still relevant to the latest version of the Dify repository by commenting here.
  • If there is no further activity, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Nov 19 '24 16:11 dosubot[bot]

I used the paid unstructured.io API and generated the documents again and even so the keywords are broken. The problem apparently is not in this process but when saving the tags in the database. I will check inside the database how they are being saved.

dersonbsb2022 avatar Nov 21 '24 15:11 dersonbsb2022

Hi, @dersonbsb2022. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • The issue involves incorrect handling of accented characters in Brazilian Portuguese documents using Dify version 0.9.1-fix1 in a self-hosted Docker setup.
  • Attempts to preprocess documents and change settings have not resolved the issue, possibly due to OpenAI's processing.
  • @crazywoola suggested checking the ETL_TYPE setting, and I provided guidance on possible solutions, including checking model documentation and trying alternative models.
  • You plan to investigate how tags are saved in the database, as the issue might be related to that.

Next Steps:

  • Please let us know if this issue is still relevant to the latest version of the Dify repository by commenting on this issue.
  • If there are no updates, the issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Dec 22 '24 16:12 dosubot[bot]