langchain
langchain copied to clipboard
S3DirectoryLoader throws S3DirectoryLoader exception when prefix is a directory
System Info
In langchain 0.0.161, if I call
S3DirectoryLoader(bucket, prefix) where prefix is a "folder name" e.g. "documents/" you get a IsADirectoryError: [Errno 21] Is a directory exception.
It looks like the code is doing a filter on the bucket objects and the filter returns the files as well as the folder itself. The S3FileLoader tries to download the folder and that's what causes the exception
s3 = boto3.resource("s3")
bucket = s3.Bucket(self.bucket)
docs = []
for obj in bucket.objects.filter(Prefix=self.prefix):
loader = S3FileLoader(self.bucket, obj.key)
docs.extend(loader.load())
Who can help?
No response
Information
- [ ] The official example notebooks/scripts
- [X] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [X] Document Loaders
- [ ] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
- Call S3DirectoryLoader(bucket, prefix) using a prefix that is a directory
- See the exception being thrown
Expected behavior
The documents inside the folder should be loaded. The folder itself should be ignored.
The filtering comes from boto3 library that langchain has anything to do with. You check check in your loop if the obs has size 0 or ends with /.
Please see the pull request I generated that mitigates this problem. I can't check in the loop, since I'm calling S3DirectoryLoader and the loop is inside langchain code. If a folder prefix is provided, the loader will throw an exception.
Hi, @markeste! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you reported is related to the S3DirectoryLoader in langchain 0.0.161. It seems that the code is filtering the bucket objects and including the folder itself, which causes an exception when the S3FileLoader tries to download the folder. PawelFaron suggested a solution to check if the objects have size 0 or end with "/", and you have generated a pull request to address the issue. Both sousanunes and jonirautiainenhubble have given a thumbs up reaction to your pull request.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution and your understanding. Let us know if you have any further questions or concerns!
This issue is still relevant today. It is also a duplicate of https://github.com/langchain-ai/langchain/issues/6535.
@baskaryan Could you please help @bensaine with this issue? They mentioned that it is still relevant today. Thank you!
Hi @hwchase17 this issue still exist
Hi, @markeste,
I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the S3DirectoryLoader in langchain 0.0.161 was throwing an exception when the prefix was a directory, leading to an IsADirectoryError. A pull request has been generated to address this issue and has received positive feedback from other contributors. The LangChain team has marked the issue as stale and is seeking confirmation on its relevance to the latest repository version.
Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.