langchain icon indicating copy to clipboard operation
langchain copied to clipboard

S3DirectoryLoader throws S3DirectoryLoader exception when prefix is a directory

Open markeste opened this issue 2 years ago • 2 comments

System Info

In langchain 0.0.161, if I call S3DirectoryLoader(bucket, prefix) where prefix is a "folder name" e.g. "documents/" you get a IsADirectoryError: [Errno 21] Is a directory exception.

It looks like the code is doing a filter on the bucket objects and the filter returns the files as well as the folder itself. The S3FileLoader tries to download the folder and that's what causes the exception

        s3 = boto3.resource("s3")
        bucket = s3.Bucket(self.bucket)
        docs = []
        for obj in bucket.objects.filter(Prefix=self.prefix):
            loader = S3FileLoader(self.bucket, obj.key)
            docs.extend(loader.load())

Who can help?

No response

Information

  • [ ] The official example notebooks/scripts
  • [X] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [X] Document Loaders
  • [ ] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

  1. Call S3DirectoryLoader(bucket, prefix) using a prefix that is a directory
  2. See the exception being thrown

Expected behavior

The documents inside the folder should be loaded. The folder itself should be ignored.

markeste avatar May 08 '23 08:05 markeste

The filtering comes from boto3 library that langchain has anything to do with. You check check in your loop if the obs has size 0 or ends with /.

PawelFaron avatar May 08 '23 09:05 PawelFaron

Please see the pull request I generated that mitigates this problem. I can't check in the loop, since I'm calling S3DirectoryLoader and the loop is inside langchain code. If a folder prefix is provided, the loader will throw an exception.

markeste avatar May 09 '23 08:05 markeste

Hi, @markeste! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported is related to the S3DirectoryLoader in langchain 0.0.161. It seems that the code is filtering the bucket objects and including the folder itself, which causes an exception when the S3FileLoader tries to download the folder. PawelFaron suggested a solution to check if the objects have size 0 or end with "/", and you have generated a pull request to address the issue. Both sousanunes and jonirautiainenhubble have given a thumbs up reaction to your pull request.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution and your understanding. Let us know if you have any further questions or concerns!

dosubot[bot] avatar Sep 12 '23 16:09 dosubot[bot]

This issue is still relevant today. It is also a duplicate of https://github.com/langchain-ai/langchain/issues/6535.

bensaine avatar Sep 28 '23 20:09 bensaine

@baskaryan Could you please help @bensaine with this issue? They mentioned that it is still relevant today. Thank you!

dosubot[bot] avatar Sep 28 '23 20:09 dosubot[bot]

Hi @hwchase17 this issue still exist

VpkPrasanna avatar Oct 25 '23 13:10 VpkPrasanna

Hi, @markeste,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the S3DirectoryLoader in langchain 0.0.161 was throwing an exception when the prefix was a directory, leading to an IsADirectoryError. A pull request has been generated to address this issue and has received positive feedback from other contributors. The LangChain team has marked the issue as stale and is seeking confirmation on its relevance to the latest repository version.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

I

dosubot[bot] avatar Feb 06 '24 16:02 dosubot[bot]