filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

FileNotFound for s3 file with / in the key

Open nick-amplify opened this issue 1 year ago • 11 comments

Hello! I am trying to read a csv stored in a remote s3 bucket with a / in the name, like this: 's3://mybucket/path/7/update//nicktorba.part_00000'

When i run this code, I get FileNotFound error: (as far as I can tell, when I run a pandas read_csv, this is the code used to read the file)

file_obj = fsspec.open(
    filepath, mode="rb", **(storage_options or {})
).open()

However, this code runs successfully:

s3_client = boto3.client('s3', aws_access_key_id=storage_options["key"], aws_secret_access_key=storage_options["secret"])

s3_object = s3_client.get_object(
    Bucket="my-bucket", 
    Key="path/7/update//nicktorba.part_00000"
)

df = pd.read_csv(s3_object['Body'], nrows=5)

Is there any way I can update the args to the fspec.open in the first code snippet to have it successfully read the file? I'm positive it exists and my access is set up correctly because the second snippet works.

Thank you!

(side note: I am unfortunately not the one in charge of the file naming, so removing the slash isn't an option at the moment)

nick-amplify avatar Jul 10 '23 21:07 nick-amplify

You suspect that it is the double slash "//" that is the problem? Do other paths in the same bucket work?

The first thing I would do, is turn on s3fs logging to see exactly what calls are being made. One way:

fsspec.utils.setup_logging(logger_name="s3fs")

martindurant avatar Jul 11 '23 13:07 martindurant

@martindurant Here are the logs shown:

2023-07-11 09:22:35,439 - s3fs - DEBUG - connect -- Setting up s3fs instance
2023-07-11 09:22:35,439 - s3fs - DEBUG - Setting up s3fs instance
2023-07-11 09:22:35,509 - s3fs - DEBUG - _lsdir -- Get directory listing page for mybucket/path/7/update
2023-07-11 09:22:35,509 - s3fs - DEBUG - Get directory listing page for mybucket/path/7/update
2023-07-11 09:22:36,422 - s3fs - DEBUG - _lsdir -- Get directory listing page for mybucket/path/7/update//nicktorba.part_00000
2023-07-11 09:22:36,422 - s3fs - DEBUG - Get directory listing page for mybucket/path/7/update//nicktorba.part_00000

Under that it throws the same FileNotFoundError.

Other paths in the same bucket work as expected.

nick-amplify avatar Jul 11 '23 13:07 nick-amplify

Hm, s3fs should not be calling LIST upon open, but HEAD. It can be the case that you have permissions for one but not the other, and for listing, the "/" character is indeed special. What version of s3fs are you using?

martindurant avatar Jul 11 '23 13:07 martindurant

@martindurant current s3fs.__version__ is '0.4.0'

I just updated to 2023.6.0 and seems to hit the same problem

Also, I get a FileNotFound when directly calling head on s3fs:

s3 = s3fs.S3FileSystem(
    key=storage_options["key"],
    secret=storage_options["secret"]
)
s3.head(filepath)

I'm not sure if that is the operation you meant when you said it should be called instead of list.

nick-amplify avatar Jul 11 '23 13:07 nick-amplify

No, that's a different HEAD :|

The following test fails, so there is indeed a problem, likely in fsspec

def test_multi_slash(s3):
    fn = "test/path//with/slash"
    s3.pipe(fn, b"data")
    files = s3.find("test/path", detail=False)
    assert fn in files
    with fsspec.open(f"s3://{fn}") as f:  # <- fails here
        assert f.read() == b"data"

martindurant avatar Jul 11 '23 13:07 martindurant

Switching fsspec.open for s3.open makes the test pass, so this is definitely in fsspec, it must be converting "//" to "/". This find at least gives you a workaround.

martindurant avatar Jul 11 '23 13:07 martindurant

Switching fsspec.open for s3.open makes the test pass, so this is definitely in fsspec, it must be converting "//" to "/". This find at least gives you a workaround.

@martindurant What is the s3 object in that test?

Also, unfortunately, I'm hitting this error from pandas.read_csv, so I can't easily replace that code. I opened an issue on pandas as well since I wasn't sure which one would be better: https://github.com/pandas-dev/pandas/issues/54070

nick-amplify avatar Jul 11 '23 14:07 nick-amplify

s3 is the S3FileSystem instance. I understand that this should work directly with pandas.read_csv, but you could for the moment do:

s3 = fsspec.filesystem("s3", **storage_options)
with s3.open(path) as f:
    df = pd.read_csv(f)

martindurant avatar Jul 11 '23 14:07 martindurant

s3 is the S3FileSystem instance. I understand that this should work directly with pandas.read_csv, but you could for the moment do:

s3 = fsspec.filesystem("s3", **storage_options)
with s3.open(path) as f:
    df = pd.read_csv(f)

What version of fsspec are you running? I have 2022.10.0 and that code fails with FileNotFound as well

nick-amplify avatar Jul 11 '23 14:07 nick-amplify

2023.6.0

martindurant avatar Jul 11 '23 14:07 martindurant

Correction: my test was wrong, things do really work fine with the test server. The following public file works fine:

In [19]: with fsspec.open('s3://mymdtemp/path//with/slash', anon=True) as f:
    ...:     print(f.read())
    ...:
b'hello'

martindurant avatar Jul 11 '23 14:07 martindurant