azure-search-openai-demo
azure-search-openai-demo copied to clipboard
I would like to make the app be able to take more than just pdf documents.
Please provide us with the following information:
This issue is for a: (mark with an x)
- [ ] bug report -> please search issues before submitting
- [x ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)
Minimal steps to reproduce
none
Any log messages given by the failure
none
Expected/desired behavior
What is the code to modify, add, write ? To make the capability possible.
OS and Version?
Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)
azd version?
run
azd versionand copy paste here.
Versions
Mention any other details that might be useful
Thanks! We'll be in touch soon.
Which document type are you specifically looking for? We're merging two PRs very soon that should help, but it depends on the type.
Are HTML document types on the table? That would be huge for our organization
Great question! Apparently the new Document Intelligence version does support parsing HTML, but I neglected to add it to the list. You can add it here: https://github.com/Azure-Samples/azure-search-openai-demo/blob/2e79777f01f355f6e01e5b288ff290e39921bf16/scripts/prepdocs.py#L71
I'll send a PR to add it shortly.
See https://github.com/Azure-Samples/azure-search-openai-demo/pull/1325
From what I understood, document intelligence needs to support the document but you also need to have a parser written in Python for each type of document. Is that correct? We would like to handle email, slack conversations, confluence documents, etc. These sources of data are used by organizations and this is where the added value is.
@hammadilyes You only need a local Python parser for HTML if you decide not to use Azure Document Intelligence. There is a local HTMLParser in another repo that we could bring in, if folks want a non-DocIntel option: https://github.com/microsoft/sample-app-aoai-chatGPT/blob/16c260c5abd420ea29379c9be7e52c8069c63901/scripts/data_utils.py#L315
And yes, if you're trying to parse other document types, you'll need to write your own parser that's similar to the existing local parsers.
I am confused. So you dont need the python files when using the document intelligence service ?