langchain icon indicating copy to clipboard operation
langchain copied to clipboard

community[feat]: Add Apache Opendal S3 document loader

Open liugddx opened this issue 1 year ago • 2 comments

  • Description: Add Apache Opendal S3 document loader

OpenDAL represents Open Data Access Layer. Our vision is to access data freely.It can serve as a unifying layer for object storage and file storage.I will improve other types of storage in subsequent PRs.

https://opendal.apache.org/ image

@baskaryan, @efriis, @eyurtsev, @ccurme, @vbarda, @hwchase17.

liugddx avatar Aug 25 '24 14:08 liugddx

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 3, 2024 6:15am

vercel[bot] avatar Aug 25 '24 14:08 vercel[bot]

@baskaryan PTAL

liugddx avatar Sep 10 '24 22:09 liugddx

hey there! sorry for the delay on this one. Not mergable as-is, and 1 question and 1 suggestion

  • question: what is the difference between this and an S3 loader?
  • suggestion: implement this as a BaseBlobLoader instead, which can be used with unstructured as a blobparser

let me know if that works for you!

efriis avatar Dec 11 '24 01:12 efriis

hey there! sorry for the delay on this one. Not mergable as-is, and 1 question and 1 suggestion

  • question: what is the difference between this and an S3 loader?
  • suggestion: implement this as a BaseBlobLoader instead, which can be used with unstructured as a blobparser

let me know if that works for you!

Thank you for taking the time to review my PR. OpenDAL is an Open Data Access Layer that enables seamless interaction with diverse storage services.OpenDAL is an Open Data Access Layer that enables seamless interaction with diverse storage services.

In fact, I want to use opendal to unify all my storage access.

liugddx avatar Dec 11 '24 01:12 liugddx

sure but nothing is passed into Operator that's unrelated to unstructured and S3, so I'm curious what this does differently than just loading these files from S3 and passing them to unstructured

answer to the previous question was a bit vague, and without clarity on this I'll close the PR later this week. Let me know if there's something else that I'm missing!

efriis avatar Dec 12 '24 02:12 efriis

sure but nothing is passed into Operator that's unrelated to unstructured and S3, so I'm curious what this does differently than just loading these files from S3 and passing them to unstructured

answer to the previous question was a bit vague, and without clarity on this I'll close the PR later this week. Let me know if there's something else that I'm missing!

  • Metrics, logging, tracing, can be added through opendal's layers. https://opendal.apache.org/docs/vision#4-extensible-architecture
  • No vendor lock-in

liugddx avatar Dec 12 '24 03:12 liugddx

sure but nothing is passed into Operator that's unrelated to unstructured and S3, so I'm curious what this does differently than just loading these files from S3 and passing them to unstructured answer to the previous question was a bit vague, and without clarity on this I'll close the PR later this week. Let me know if there's something else that I'm missing!

  • Metrics, logging, tracing, can be added through opendal's layers. https://opendal.apache.org/docs/vision#4-extensible-architecture
  • No vendor lock-in

We can use opendal to unify all document loaders, that is, one document reader covers all three-party systems

liugddx avatar Dec 13 '24 09:12 liugddx

Got it. This adds a net-new community integration or feature, which has been replaced by dedicated integration packages. I'll close this PR, and would recommend reopening with just docs updates, as well as registering your package in libs/packages.yml! We'll be able to review simple PRs that only modify these two things much faster :)

Here's the guide, and if you have questions, feel free to leave them in the comments on those pages so others can see them! https://python.langchain.com/docs/contributing/how_to/integrations/

Before doing so, I would highly recommend looking into implementing the BaseBlobLoader abstraction instead of requiring unstructured for pdf parsing. From what I understand, opendal should be used to load bytes, and there's no reason it should be tied to unstructured for parsing. Folks can use BaseBlobParser objects with the GenericLoader that pairs one of each.

efriis avatar Dec 13 '24 22:12 efriis