s3fs
s3fs copied to clipboard
Pagination problem when listing directories (1000 file limit)
I am prototyping an s3-compatible storage service called open storage network.
I have encountered a problem with how s3fs is listing directories which appears to be related to pagination. Basically, s3fs thinks there are only 1000 objects in the directory and refuses to even try to read objects that don't show up in this initial list.
import boto3
import s3fs
assert s3fs.__version__ == '0.4.0'
# read-only credentials to bucket, okay to share publicly
access_key = "EL456I5ZRYB44RB6J7Q4"
secret_key = "QydNAjMWBTOLRjHiA36uMvhBvI4WeTxWYNJ5oaiP"
endpoint_url = "https://ncsa.osn.xsede.org"
# create boto client
s3 = boto3.client('s3',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
endpoint_url=endpoint_url)
# verify credentials
assert s3.list_buckets()['Buckets'][0]['Name'] == 'Pangeo'
# list the bucket using recommend boto pagination technique
# https://boto3.amazonaws.com/v1/documentation/api/latest/guide/paginators.html#filtering-results
paginator = s3.get_paginator('list_objects')
operation_parameters = {'Bucket': 'Pangeo',
'Prefix': 'cm26_control_temp.zarray'}
page_iterator = paginator.paginate(**operation_parameters)
# the directory should have 2402 objects in it
for page in page_iterator:
print(len(page['Contents']))
# > 1000
# > 1000
# > 402
# Correctly finds all 2402 objects
print(page['Contents'][-1]['Key'])
# > 'cm26_control_temp.zarray/99.9.0.0'
# now try with s3fs
fs = s3fs.S3FileSystem(key=access_key, secret=secret_key,
client_kwargs={'endpoint_url': endpoint_url})
listing = fs.listdir('Pangeo/cm26_control_temp.zarray')
print(len(listing))
# > 1000
# try to read a file that did not make it into the list
with fs.open('Pangeo/cm26_control_temp.zarray/99.9.0.0') as f:
pass
# > FileNotFoundError: Pangeo/cm26_control_temp.zarray/99.9.0.0
This feels very much like a bug in s3fs. (A somewhat similar issue was noted in https://github.com/dask/s3fs/issues/253#issuecomment-557516952, including the 1000 file limit.) In fact, I would identify two distinct bugs:
- The directly listing is wrong
- s3fs is incorrectly raising a
FileNotFoundError
when I try to open an existing object (likely related to caching)
For the first issue, one possible hint could be that the aws cli also makes the same mistake:
aws s3 --profile osn-rw ls --recursive s3://Pangeo/cm26_control_temp.zarray/ | wc -l
# > 1000
So perhaps there is something in the metadata of the OSN service that is tricking the paginators in some circumstances.
This issue is rather important to Pangeo, as we are keen to get some accurate benchmarks on this new storage service. Help would be sincerely appreciated.
According to the boto docs, list_objects_v2 (the method s3fs
uses) and the list_objects
variant both say that they list up to 1000 objects only, although the paginators docs suggest that the latter returns 1000 "at a time" (which I thought was the point). Both methods take a MaxKeys
parameter which has no given default.
Thanks for your reply. But I don't understand what to conclude from it. Do you think this is something that needs to be fixed in s3fs or not? The bottom line is that, using boto paginators, I am able to correctly list the objects, but with s3fs I am not. I'd be happy to make a PR if you can recommend a course of action.
I would try swapping list_objects_v2 to list_object (you use the latter)
You're hunch was correct. If I do paginator = s3.get_paginator('list_objects_v2')
, it only gets the first 1000 results.
So this is somehow a problem with the API service?
So this is somehow a problem with the API service?
I have no idea! I don't know why there are two versions in the first place. If the structure of what is returned by list_objects
is the same, it should be simple to change the code.
@jacobtomlinson , since you looked recently at the botocore API, do you have extra information?
Ok, so I think it's a problem with the API service in OSN. Compare regular s3:
import boto3
from botocore import UNSIGNED
from botocore.client import Config
s3pub = boto3.client('s3', config=Config(signature_version=UNSIGNED))
resp = s3pub.list_objects_v2(Bucket='mur-sst', Prefix='zarr/analysed_sst')
print(list(resp.keys()))
gives
['ResponseMetadata',
'IsTruncated',
'Contents',
'Name',
'Prefix',
'MaxKeys',
'EncodingType',
'KeyCount',
'NextContinuationToken']
Now for OSN:
access_key = "EL456I5ZRYB44RB6J7Q4"
secret_key = "QydNAjMWBTOLRjHiA36uMvhBvI4WeTxWYNJ5oaiP"
endpoint_url = "https://ncsa.osn.xsede.org"
s3 = boto3.client('s3',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key,
endpoint_url=endpoint_url)
resp = s3.list_objects_v2(Bucket='Pangeo', Prefix='cm26_control_temp.zarray',
ContinuationToken='string')
print(list(resp.keys()))
gives
['ResponseMetadata',
'IsTruncated',
'Contents',
'Name',
'Prefix',
'MaxKeys',
'EncodingType']
OSN does not return KeyCount
and NextContinuationToken
. In particular, the absence of NextContinuationToken
makes it impossible to paginate the results.
Hmm that seems a shame that OSN doesn't return those keys, they will certainly be needed for pagination. Is this an issue you can raise with them?
@rabernat This issue got brought to my attention today (I'm one of the folks working on OSN). Ceph is being used as the backing store for the project and this seems to be an outstanding issue with the Rados Gateway implementation. I'm still looking to see if there are updates newer than 11 months ago, but the most recent information I've found so far is here
Thanks @jdmaloney for your reply! While perhaps we could manage to work around this in s3fs (say, by creating an option to use list_objects
rather than list_objects_v2
), my strong preference would be for an upstream fix in ceph. But the timeline you referenced above is not encouraging. 😬
There is, however, a separate issue that I raised above that has nothing to do with OSN:
s3fs is incorrectly raising a FileNotFoundError when I try to open an existing object (likely related to caching)
Since we are using consolidated metadata for the zarr store, a directory listing should never be necessary. All the keys we need are known a-priori from the metadata. @martindurant -- is there a way to bypass the automatic listing / caching that s3fs is performing? The objects are there: s3fs just needs to let me read them, rather than believing its (incorrect) cache of the directory listing.
I thought that was indeed the model - if the file is not already in the cache, a HEAD request is made. That might only be on master. Should be compared with what happens in gcsfs too.
Works:
# download file with boto client
s3.download_file('Pangeo', 'cm26_control_temp.zarray/99.9.0.0', '/dev/null')
fails:
# download file with s3fs
fs.download('Pangeo/cm26_control_temp.zarray/99.9.0.0', '/dev/null')
# > FileNotFoundError: Pangeo/cm26_control_temp.zarray/99.9.0.0
s3fs version is '0.4.0'.
I too face the same problem as my directories have more than 3000 files, is there any work around?
@smishra , are you using s3fs latest version or master? Perhaps we need a release.
I installed it yesterday (pip install s3fs) on my CentOS image. The version it shows: 0.4.2.
Name: s3fs Version: 0.4.2 Summary: Convenient Filesystem interface over S3 Home-page: http://github.com/dask/s3fs/
Would you be willing to try with master?
Let me try. Thanks
I tried again after recreating my environment with 0.4.2 version and it seems to work in python REPL. I will integrate in my PySpark and see if it works. Looks like I have to invalidate the cache though.
Ceph merged a ListObjecstV2 support over a year ago in this PR. The Ceph tracker issue linked above was worked around by the reporter by using ListObjects, once they noticed they had made a mistake -
Then tried to use the listobjects() function but I've made a mistake that I've used the Marker instead of the given NextMarker
If for whatever reason the Ceph cluster cannot be upgraded to a release that supports ListObjectsV2, then using ListObjects would be the workaround. If there are problems with ListObjects, then a Ceph tracker issue should be filed.
@mmgaggle Sorry we didn't update this thread, the cluster was updated back in February and we confirmed with @rabernat that everything worked and was resolved. The cluster was one dot release behind where that patch got merged in, just our luck :)
No worries, I had some colleagues bump into this same issue and there was confusion about what was the right thing to do. My comment was just as much about making sure other folks who stumble across this know what their options are. Glad to hear the cluster you were talking to got updated, and that you're in the clear! :)