langchain
langchain copied to clipboard
community[feat]: Add Apache Opendal S3 document loader
- Description: Add Apache Opendal S3 document loader
OpenDAL represents Open Data Access Layer. Our vision is to access data freely.It can serve as a unifying layer for object storage and file storage.I will improve other types of storage in subsequent PRs.
https://opendal.apache.org/
@baskaryan, @efriis, @eyurtsev, @ccurme, @vbarda, @hwchase17.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| langchain | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Sep 3, 2024 6:15am |
@baskaryan PTAL
hey there! sorry for the delay on this one. Not mergable as-is, and 1 question and 1 suggestion
- question: what is the difference between this and an S3 loader?
- suggestion: implement this as a
BaseBlobLoaderinstead, which can be used with unstructured as a blobparser
let me know if that works for you!
hey there! sorry for the delay on this one. Not mergable as-is, and 1 question and 1 suggestion
- question: what is the difference between this and an S3 loader?
- suggestion: implement this as a
BaseBlobLoaderinstead, which can be used with unstructured as a blobparserlet me know if that works for you!
Thank you for taking the time to review my PR. OpenDAL is an Open Data Access Layer that enables seamless interaction with diverse storage services.OpenDAL is an Open Data Access Layer that enables seamless interaction with diverse storage services.
In fact, I want to use opendal to unify all my storage access.
sure but nothing is passed into Operator that's unrelated to unstructured and S3, so I'm curious what this does differently than just loading these files from S3 and passing them to unstructured
answer to the previous question was a bit vague, and without clarity on this I'll close the PR later this week. Let me know if there's something else that I'm missing!
sure but nothing is passed into
Operatorthat's unrelated to unstructured and S3, so I'm curious what this does differently than just loading these files from S3 and passing them to unstructuredanswer to the previous question was a bit vague, and without clarity on this I'll close the PR later this week. Let me know if there's something else that I'm missing!
- Metrics, logging, tracing, can be added through opendal's layers. https://opendal.apache.org/docs/vision#4-extensible-architecture
- No vendor lock-in
sure but nothing is passed into
Operatorthat's unrelated to unstructured and S3, so I'm curious what this does differently than just loading these files from S3 and passing them to unstructured answer to the previous question was a bit vague, and without clarity on this I'll close the PR later this week. Let me know if there's something else that I'm missing!
- Metrics, logging, tracing, can be added through opendal's layers. https://opendal.apache.org/docs/vision#4-extensible-architecture
- No vendor lock-in
We can use opendal to unify all document loaders, that is, one document reader covers all three-party systems
Got it. This adds a net-new community integration or feature, which has been replaced by dedicated integration packages. I'll close this PR, and would recommend reopening with just docs updates, as well as registering your package in libs/packages.yml! We'll be able to review simple PRs that only modify these two things much faster :)
Here's the guide, and if you have questions, feel free to leave them in the comments on those pages so others can see them! https://python.langchain.com/docs/contributing/how_to/integrations/
Before doing so, I would highly recommend looking into implementing the BaseBlobLoader abstraction instead of requiring unstructured for pdf parsing. From what I understand, opendal should be used to load bytes, and there's no reason it should be tied to unstructured for parsing. Folks can use BaseBlobParser objects with the GenericLoader that pairs one of each.