azure-search-openai-demo icon indicating copy to clipboard operation
azure-search-openai-demo copied to clipboard

I would like to make the app be able to take more than just pdf documents.

Open oracle-code opened this issue 1 year ago • 1 comments
trafficstars

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [x ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

none

Any log messages given by the failure

none

Expected/desired behavior

What is the code to modify, add, write ? To make the capability possible.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

oracle-code avatar Feb 01 '24 19:02 oracle-code

Which document type are you specifically looking for? We're merging two PRs very soon that should help, but it depends on the type.

pamelafox avatar Feb 02 '24 00:02 pamelafox

Are HTML document types on the table? That would be huge for our organization

guestC-eskerSA avatar Feb 26 '24 20:02 guestC-eskerSA

Great question! Apparently the new Document Intelligence version does support parsing HTML, but I neglected to add it to the list. You can add it here: https://github.com/Azure-Samples/azure-search-openai-demo/blob/2e79777f01f355f6e01e5b288ff290e39921bf16/scripts/prepdocs.py#L71

I'll send a PR to add it shortly.

pamelafox avatar Feb 26 '24 21:02 pamelafox

See https://github.com/Azure-Samples/azure-search-openai-demo/pull/1325

pamelafox avatar Feb 27 '24 14:02 pamelafox

From what I understood, document intelligence needs to support the document but you also need to have a parser written in Python for each type of document. Is that correct? We would like to handle email, slack conversations, confluence documents, etc. These sources of data are used by organizations and this is where the added value is.

oracle-code avatar Feb 27 '24 16:02 oracle-code

@hammadilyes You only need a local Python parser for HTML if you decide not to use Azure Document Intelligence. There is a local HTMLParser in another repo that we could bring in, if folks want a non-DocIntel option: https://github.com/microsoft/sample-app-aoai-chatGPT/blob/16c260c5abd420ea29379c9be7e52c8069c63901/scripts/data_utils.py#L315

And yes, if you're trying to parse other document types, you'll need to write your own parser that's similar to the existing local parsers.

pamelafox avatar Feb 27 '24 18:02 pamelafox

I am confused. So you dont need the python files when using the document intelligence service ?

oracle-code avatar Feb 27 '24 18:02 oracle-code