Problems indexing knowledge and accentuation documents in PT-BR
Self Checks
- [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template :) and fill in all the required fields.
Dify version
0.9.1-fix1
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
When inserting a document in Brazilian Portuguese, the entire indexing process goes smoothly. The result in question and answer format is almost all correct, the only problem is with the formation of the keywords.
Words with accentuation in Portuguese such as:
Example: portões - port es horários - hor rios Creating wrong keywords that hinder the search. After I correct these words, the results improve considerably.
To convert the document, I use openai-4o-mini index-mode: Segmenting in Question & Answer format Segment using Portuguese embedding model - text-embedding-3-small Hybrid search top k - 6 - rerank model - cohere free - multilingual - v3.0
The question about the bug is:
is it a bug in dify? is it a bug in openai? is it a bug in cohere?
I don't know what steps I can take to try to solve the problem. No further information appears in the logs even when put in DEBUG mode.
✔️ Expected Behavior
that the keywords were generated correctly
Examples: <Portões> <Horários>
or that they were <Portoes> <Horarios>
and not broken
❌ Actual Behavior
the generated keywords are port es ho rios
Removing important letters, adding a space in place of the accented letter and leaving the keywords meaningless.
Olá @dersonbsb2022! Eu sou um bot que pode te ajudar a resolver bugs, responder perguntas e contribuir. Estou aqui para ajudar enquanto você aguarda um mantenedor humano.
I found a similar discussion regarding issues with accented characters in keyword generation, specifically in French documents [1]. This might indicate a broader issue with handling accented characters in Dify.
For your specific case with Portuguese documents, the problem could be related to the text-embedding-3-small model or the rerank-multilingual-v3.0 model, as these are involved in the indexing and retrieval process. Unfortunately, there isn't specific information available about how these models handle accented characters.
To address this issue, you might consider:
- Checking Model Documentation: Look into the documentation for the models you're using to see if there are known limitations with accented characters.
- Preprocessing Text: As a temporary workaround, preprocess your text to normalize accented characters before indexing.
- Alternative Models: Experiment with different models that might handle accented characters better.
If the issue persists, it might be worth reporting it as a potential bug in the Dify repository, specifying the models and configurations you're using.
To continue talking to Dosu, mention @dosu.
https://github.com/langgenius/dify/blob/9275760599abf6df762b6f8babf7eba54554119d/docker/.env.example#L505-L508
What did you set?
If you set the ETL_TYPE to dify then I guess this will happens.
First of all, thank you for the wonderful project and for your response. @crazywoola Sorry for the delay in responding but during this time I'm changed all possible settings, I even forked the EasyPanel project to configure Dify with Unstructured, but without success. Although the service started, I uploaded several files, docx, pdf, and everything continued to be processed by Dify, I don't know what I did wrong.
I will follow your suggestion and pre-process the documents before sending them.
Thank you very much.
Just to document, even though I pre-process the document and upload it to the knowledge base, the task when executed, I believe it is OpenAI, returns all the accents and the problem continues.
Hi, @dersonbsb2022. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.
Issue Summary
- The issue involves incorrect handling of accented characters in Brazilian Portuguese documents in Dify version 0.9.1-fix1.
- Suggested solutions included checking model documentation, preprocessing text, and trying alternative models.
- @crazywoola mentioned the
ETL_TYPEsetting as a potential factor. - Despite preprocessing attempts, the issue persists, possibly due to OpenAI's processing.
Next Steps
- Please confirm if this issue is still relevant to the latest version of the Dify repository by commenting here.
- If there is no further activity, this issue will be automatically closed in 15 days.
Thank you for your understanding and contribution!
I used the paid unstructured.io API and generated the documents again and even so the keywords are broken. The problem apparently is not in this process but when saving the tags in the database. I will check inside the database how they are being saved.
Hi, @dersonbsb2022. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.
Issue Summary:
- The issue involves incorrect handling of accented characters in Brazilian Portuguese documents using Dify version 0.9.1-fix1 in a self-hosted Docker setup.
- Attempts to preprocess documents and change settings have not resolved the issue, possibly due to OpenAI's processing.
- @crazywoola suggested checking the
ETL_TYPEsetting, and I provided guidance on possible solutions, including checking model documentation and trying alternative models. - You plan to investigate how tags are saved in the database, as the issue might be related to that.
Next Steps:
- Please let us know if this issue is still relevant to the latest version of the Dify repository by commenting on this issue.
- If there are no updates, the issue will be automatically closed in 15 days.
Thank you for your understanding and contribution!