adlfs first run of Parquet schema read much slower than BlobClient?

first run of Parquet schema read much slower than BlobClient?

Open astrowonk opened this issue 1 year ago • 9 comments

First run this took 40 seconds +

#slow
with abfs.open('example-container/test.parquet') as f:
    schema = pq.read_schema(f)

This which I'm pretty sure downloads the entire blob takes under half a second.

#fast
import io
with io.BytesIO() as byte_stream:
    myBlobServiceClient.get_blob_client(
                    'example-container',
                    'test.parquet').download_blob().readinto(byte_stream)
    schema = pq.read_schema(byte_stream)

but then if I run the abfs line again (even with a diferent file) it's fast! 101ms

#now it's fast
with abfs.open('example-container/test2.parquet') as f:
    schema = pq.read_schema(f)

Am I doing something wrong? Is there some odd initialize or scan that's making the initial use of abfs so slow?

Sep 26 '22 13:09 astrowonk

You will experience this delay only if your container contains a lot of blobs. The reason behind this is unnecessary caching of blob details from the whole container presented in https://github.com/fsspec/adlfs/blob/main/adlfs/spec.py#L751-L759

This should probably be fixed as it affects both .open() and .info() methods if the container is big. One can retrieve the file contents with .cat() which is not affected as it's not calling .info() but it's not ideal.

Oct 03 '22 18:10 zjevik

This is related to #248 because the info method is called for .open() and would be fixed by PR #321. Though if one is specifying a path, I'm not sure why ls is ever needed, try the path to the blob specified and then if it's not there raise FileNotFound. I don't quite understand the need to scan the entire container when open() has to specify the full path.

Oct 03 '22 19:10 astrowonk

Well, I cloned and installed from the repo and I am now running adlfs-2022.10.0+4.gf15c37a . My original example of :

with abfs.open('example-container/test2.parquet') as f:
    schema = pq.read_schema(f)

still takes 40 seconds on first run when simply reading in the entire blob with get_blob_client takes 1-2 seconds.

abfs.info('example-container/test2.parquet')

Similarly takes 40 seconds on first run. (then fast after that.)

The code in #360 certainly seems like it'd improve info performance but I'm specifying the full path and it's still slow first run. Feels like it's scanning an entire container.

Oct 18 '22 15:10 astrowonk

@astrowonk -- Can you share a reproducer? I've tried this with single file from a partitioned parquet dataset and I'm getting < 0.5 seconds for the open operation, and <0.001 second for the info() operation.

Oct 21 '22 01:10 hayesgb

@astrowonk -- Can you share a reproducer? I've tried this with single file from a partitioned parquet dataset and I'm getting < 0.5 seconds for the open operation, and <0.001 second for the info() operation.

I'll try; I need a public (?) container with lots of files. This is happening with a container with probably thousands of files in the container. How many other blobs are in your test container?

Oct 21 '22 01:10 astrowonk

Hi @hayesgb. It's not really possible to make a simple test because you need a lot of files in the container to see performance problem. However if you can get the debugger to print DEBUG level statements (I have had no luck so I had to switch line 920 to a print) you'll see that it's calling ls on the entire container for any .info. The original fix in #321 would have re-ordered the calls such that this line:

out = await self._ls(path, invalidate_cache=invalidate_cache, **kwargs)

would run first. But the actual merged code in #360 still does this first:

        out = await self._ls(
            self._parent(fullpath), invalidate_cache=invalidate_cache, **kwargs
        )

self.parent(fullpath) in the parent container, so it is now calling ls on the entire container which will be extremely slow if there are lots of files. The original re-ordering of #321 is preferred, where the first _ls call is to the requested path without wrapping in self._parent

Honestly if I'm calling info on a specific path, I don't quite understand why we'd ever want to ls the entire container in response.. even if the full path was missing. I'd just raise a path not found error.

Oct 28 '22 14:10 astrowonk

Adding more details here @hayesgb. The ordering from https://github.com/fsspec/adlfs/pull/321 is preferred over https://github.com/fsspec/adlfs/pull/360 as @astrowonk mentioned but there's another issue even with the corrected ordering.

It's when you call AzureBlobFileSystem.info() for cached blobs

    def info(self, path, refresh=False, **kwargs):
        try:
            fetch_from_azure = (path and self._ls_from_cache(path) is None) or refresh
        except Exception:
            fetch_from_azure = True
        if fetch_from_azure:
            return sync(
                self.loop,
                self._info,
                path,
                refresh,
                version_id=kwargs.get("version_id"),
            )
        return super().info(path)

this will call AbstractFileSystem.info() which contains the doomed ls call on (possibly) large container and we're doing ls on collection for every single blob again 😞 https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L614

Oct 28 '22 18:10 zjevik

adlfs adlfs copied to clipboard

first run of Parquet schema read much slower than BlobClient?

adlfs
adlfs copied to clipboard