azure-search-openai-demo I would like to make the app be able to take more than just pdf documents.

I would like to make the app be able to take more than just pdf documents.

Open oracle-code opened this issue 1 year ago • 1 comments

trafficstars

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [x ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

none

Any log messages given by the failure

none

Expected/desired behavior

What is the code to modify, add, write ? To make the capability possible.

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

Feb 01 '24 19:02 oracle-code

Which document type are you specifically looking for? We're merging two PRs very soon that should help, but it depends on the type.

Feb 02 '24 00:02 pamelafox

Are HTML document types on the table? That would be huge for our organization

Feb 26 '24 20:02 guestC-eskerSA

Great question! Apparently the new Document Intelligence version does support parsing HTML, but I neglected to add it to the list. You can add it here: https://github.com/Azure-Samples/azure-search-openai-demo/blob/2e79777f01f355f6e01e5b288ff290e39921bf16/scripts/prepdocs.py#L71

I'll send a PR to add it shortly.

Feb 26 '24 21:02 pamelafox

See https://github.com/Azure-Samples/azure-search-openai-demo/pull/1325

Feb 27 '24 14:02 pamelafox

From what I understood, document intelligence needs to support the document but you also need to have a parser written in Python for each type of document. Is that correct? We would like to handle email, slack conversations, confluence documents, etc. These sources of data are used by organizations and this is where the added value is.

Feb 27 '24 16:02 oracle-code

@hammadilyes You only need a local Python parser for HTML if you decide not to use Azure Document Intelligence. There is a local HTMLParser in another repo that we could bring in, if folks want a non-DocIntel option: https://github.com/microsoft/sample-app-aoai-chatGPT/blob/16c260c5abd420ea29379c9be7e52c8069c63901/scripts/data_utils.py#L315

And yes, if you're trying to parse other document types, you'll need to write your own parser that's similar to the existing local parsers.

Feb 27 '24 18:02 pamelafox

I am confused. So you dont need the python files when using the document intelligence service ?

Feb 27 '24 18:02 oracle-code

azure-search-openai-demo azure-search-openai-demo copied to clipboard

I would like to make the app be able to take more than just pdf documents.

Please provide us with the following information:

This issue is for a: (mark with an x)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

azd version?

Versions

Mention any other details that might be useful

azure-search-openai-demo
azure-search-openai-demo copied to clipboard

This issue is for a: (mark with an `x`)