langchain Docugami DataLoader

Adds a document loader for Docugami

Specifically:

Adds a data loader that talks to the Docugami API to download processed documents as semantic XML
Parses the semantic XML into chunks, with additional metadata capturing chunk semantics
Adds a detailed notebook showing how you can use additional metadata returned by Docugami for techniques like the self-querying retriever
Adds an integration test, and related documentation

I ran the following locally:

make format
make lint
poetry run pytest tests/integration_tests/document_loaders/test_docugami.py

Here is an example of a result that is not possible without the capabilities added by Docugami (from the notebook):

May 10 '23 19:05 tjaffri

Thank you in advance for your review @dev2049 @eyurtsev. I just marked this draft PR ready for review, so please let me know if I can add any other details, or any feedback I can resolve quickly.

May 11 '23 09:05 tjaffri

@dev2049 @eyurtsev would really appreciate your feedback on this. Excited to hear what you think, would love to move quickly to address any feedback you may have.

Also, @hwchase17 I have been watching your recent posts regarding the self-querying retriever on twitter, e.g. this one from yesterday: https://twitter.com/hwchase17/status/1656791492579713024 so you might want to check out the notebook in this PR which uses metadata from other parts of the document via the self-querying retriever.

Also I noticed you will be speaking about retrieval on June 8: https://twitter.com/zilliz_universe/status/1657084479993683969 and we would love to help however we can. Docugami was created by documents experts, including Jean Paoli who is one of the authors of the XML standard, and drove the creation of the OpenXML format (i.e. docx) and we would love to chat about the space.

May 12 '23 19:05 tjaffri

@tjaffri thanks for submitting the PR! I'll review on Monday (maybe before if I find enough time). Apologies for the delay!

May 13 '23 14:05 eyurtsev

Merged on separate branch

May 15 '23 14:05 eyurtsev

langchain langchain copied to clipboard

Docugami DataLoader

Adds a document loader for Docugami

langchain
langchain copied to clipboard