langchain
langchain copied to clipboard
Docugami DataLoader
Adds a document loader for Docugami
Specifically:
- Adds a data loader that talks to the Docugami API to download processed documents as semantic XML
- Parses the semantic XML into chunks, with additional metadata capturing chunk semantics
- Adds a detailed notebook showing how you can use additional metadata returned by Docugami for techniques like the self-querying retriever
- Adds an integration test, and related documentation
I ran the following locally:
- make format
- make lint
- poetry run pytest tests/integration_tests/document_loaders/test_docugami.py
Here is an example of a result that is not possible without the capabilities added by Docugami (from the notebook):
Thank you in advance for your review @dev2049 @eyurtsev. I just marked this draft PR ready for review, so please let me know if I can add any other details, or any feedback I can resolve quickly.
@dev2049 @eyurtsev would really appreciate your feedback on this. Excited to hear what you think, would love to move quickly to address any feedback you may have.
Also, @hwchase17 I have been watching your recent posts regarding the self-querying retriever on twitter, e.g. this one from yesterday: https://twitter.com/hwchase17/status/1656791492579713024 so you might want to check out the notebook in this PR which uses metadata from other parts of the document via the self-querying retriever.
Also I noticed you will be speaking about retrieval on June 8: https://twitter.com/zilliz_universe/status/1657084479993683969 and we would love to help however we can. Docugami was created by documents experts, including Jean Paoli who is one of the authors of the XML standard, and drove the creation of the OpenXML format (i.e. docx) and we would love to chat about the space.
@tjaffri thanks for submitting the PR! I'll review on Monday (maybe before if I find enough time). Apologies for the delay!
Merged on separate branch