azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

Is there a way to search document language other than English?

Open andycaho opened this issue 1 year ago • 2 comments

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I have upload some Traditional Chinese document file and follow the steps to use prepdocs to parse the document, but when I asked related question in Chinese related to the document, it cannot answer any of it.

Any log messages given by the failure

Expected/desired behavior

It can answer questions regardless of the document language.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?) Windows 11

Versions

Mention any other details that might be useful

I have tryied to build my own index and run indexer with Chinese Analyzer but it's not working. (Already set the env variable AZURE_SEARCH_INDEX to the new one). The default content analyzer in gptkbindex is English.


Thanks! We'll be in touch soon.

andycaho avatar Mar 23 '23 00:03 andycaho

In file ./scripts/prepdocs.py, you should change the function create_search_index() to create an indexer to search in Chinese or other languages. By default, the language is set to English:

SearchableField(name="content", type="Edm.String", analyzer_name="en.microsoft"),

I suggest changing the code and setting analyzer_name="standard.lucene", which seems to work properly for common languages. For more information on the available languages, refer to the docs: https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.models.searchfield?view=azure-python

Hope this works for you.

gonzalorecio avatar Mar 23 '23 15:03 gonzalorecio

You can try modifying the prompt of app\backend\approvals\chateadretrieveread.py by changing "If the question is not in English, translate the question to English before generating the search query." to "Please search in the language of the original input of the question, never try to translate it into English."

XunLi-Nick avatar Mar 29 '23 15:03 XunLi-Nick

If the question is not in English, translate the question to English before generating the search query.

Would be great if this was not part of the default template. Many customers that struggle with Azure OpenAI on your Data and non English documents will have a look at this accelerator.

iMicknl avatar Nov 29 '23 22:11 iMicknl

Hm, good point. We could alter the prompt based off the query_language parameter? That presumably would reflect the language of the documents in the search index.

I can also flesh out our section about query_language into a whole doc, and suggest tweaking this part of the prompt.

pamelafox avatar Nov 29 '23 22:11 pamelafox