langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Docugami DataLoader

Open tjaffri opened this issue 1 year ago • 3 comments

Adds a document loader for Docugami

Specifically:

  1. Adds a data loader that talks to the Docugami API to download processed documents as semantic XML
  2. Parses the semantic XML into chunks, with additional metadata capturing chunk semantics
  3. Adds a detailed notebook showing how you can use additional metadata returned by Docugami for techniques like the self-querying retriever
  4. Adds an integration test, and related documentation

I ran the following locally:

  • make format
  • make lint
  • poetry run pytest tests/integration_tests/document_loaders/test_docugami.py

Here is an example of a result that is not possible without the capabilities added by Docugami (from the notebook):

image

tjaffri avatar May 10 '23 19:05 tjaffri

Thank you in advance for your review @dev2049 @eyurtsev. I just marked this draft PR ready for review, so please let me know if I can add any other details, or any feedback I can resolve quickly.

tjaffri avatar May 11 '23 09:05 tjaffri

@dev2049 @eyurtsev would really appreciate your feedback on this. Excited to hear what you think, would love to move quickly to address any feedback you may have.

Also, @hwchase17 I have been watching your recent posts regarding the self-querying retriever on twitter, e.g. this one from yesterday: https://twitter.com/hwchase17/status/1656791492579713024 so you might want to check out the notebook in this PR which uses metadata from other parts of the document via the self-querying retriever.

Also I noticed you will be speaking about retrieval on June 8: https://twitter.com/zilliz_universe/status/1657084479993683969 and we would love to help however we can. Docugami was created by documents experts, including Jean Paoli who is one of the authors of the XML standard, and drove the creation of the OpenXML format (i.e. docx) and we would love to chat about the space.

tjaffri avatar May 12 '23 19:05 tjaffri

@tjaffri thanks for submitting the PR! I'll review on Monday (maybe before if I find enough time). Apologies for the delay!

eyurtsev avatar May 13 '23 14:05 eyurtsev

Merged on separate branch

eyurtsev avatar May 15 '23 14:05 eyurtsev