s3path icon indicating copy to clipboard operation
s3path copied to clipboard

glob is inefficient as it's iterating a dir that already scanned

Open LeoQuote opened this issue 4 years ago • 11 comments

I tried glob method and found it is too slow when there're millions of files in the directory.

turns out that the glob method will first call list_objects_v2 api first, get all files (every single file including folders and files), identify all files to see if they are folders. and then scan the folders.

The algorighm is corret in traditional fs, while inefficient in s3, s3 will return every object when requesting list_objects_v2 api, iterating subfolders are unneccessary.

Is that possible to fix it in s3path or it can only be fixed in pathlib ?

LeoQuote avatar Jul 07 '21 11:07 LeoQuote

hmmm I checked the list_object_v2 api again and found that it only produce folders when scan folder for the first time, this is a long last problem....

LeoQuote avatar Jul 07 '21 12:07 LeoQuote

I found that if I set Delimiter to '', all contents of the folder will be returned. maybe we can use this to make it more efficient.

https://github.com/liormizr/s3path/blob/a20403d49450987d0dc3955df1c4db2d2968bb7e/s3path.py#L115-L117

LeoQuote avatar Jul 07 '21 12:07 LeoQuote

I replaced glob method to

    def glob(self, pattern):
        bucket_name = self.bucket
        resource, _ = self._accessor.configuration_map.get_configuration(self)
        if not bucket_name:
            for bucket in resource.buckets.filter(Prefix=str(self)):
                yield S3Path(bucket)
            return
        bucket = resource.Bucket(bucket_name)

        kwargs = {
            'Bucket': bucket_name,
            'Prefix': self._accessor.generate_prefix(self),
            'Delimiter': ''}
        continuation_token = None
        while True:
            if continuation_token:
                kwargs['ContinuationToken'] = continuation_token
            response = bucket.meta.client.list_objects_v2(**kwargs)
            for file in response['Contents']:
                file_path = S3Path(f"/{bucket_name}/{file['Key']}")
                if fnmatch(str(file_path.relative_to(self)), pattern):
                    yield file_path
            if not response.get('IsTruncated'):
                break
            continuation_token = response.get('NextContinuationToken')

and it worked! but only for files not for folders.

LeoQuote avatar Jul 07 '21 12:07 LeoQuote

Hi @LeoQuote, Sorry for the delay ...

Indeed the glod method isn't optimized at this time. Right now S3Path glob method is using the default pathlib algorithm. So we have room to optimize.

now about your suggestion, the pathlib glob is able to find folders. For example:

for path in Path.home().glob('*/'):
     print(path)

we need to come up with an implementation that will keep the pathlib behaviour (find folders and files...)

Do you want to keep working on it? os should I take it from here

liormizr avatar Jul 13 '21 08:07 liormizr

hmmm, I dont know, it's just for me I need all files like "*.html" in all subdirectories. If I use regular pathlib style and walk through all directories, it would be expensive.

I think we can add some s3-only method like, s3path.s3glob to indicate that this method does not return exactly like pahtlib.

You can take it from here.

LeoQuote avatar Jul 14 '21 06:07 LeoQuote

OK I'll work on something and update with progress

liormizr avatar Jul 14 '21 06:07 liormizr

Does it suck up an entire file list before globbing?

This is from pathlib.py in the python stdlib:

def _select_from(self, parent_path, is_dir, exists, scandir):
    try:
        with scandir(parent_path) as scandir_it:
            entries = list(scandir_it)
        for entry in entries:

four43 avatar Jul 15 '21 12:07 four43

@four43 yes, you are right One of the optimizations that I want to do is remove this list creation in the s3 implementation

liormizr avatar Jul 15 '21 13:07 liormizr

Oh man, python why!? That's a mega bummer!

four43 avatar Jul 15 '21 13:07 four43

If you want to contribute to stdlib... ;-)

https://docs.python.org/3/library/os.html#os.scandir.close Maybe in filesystem context thy need to cleanup? Not sure ...

liormizr avatar Jul 15 '21 13:07 liormizr

Yeah they're using a context manager, which is fine, so it's closing and stuff. I'm guessing it's an optimization to open it once, rip everything into memory quick, then release it? That also seems like it would be bad for remote mounted file systems like NFS and friends.

Someone was trying to make stuff faster here: https://www.python.org/dev/peps/pep-0471/

four43 avatar Jul 15 '21 13:07 four43

Release in Version 0.4.0

liormizr avatar Jan 05 '23 11:01 liormizr

Thanks for @liormizr 's work !

LeoQuote avatar Jan 05 '23 11:01 LeoQuote