adlfs
adlfs copied to clipboard
first run of Parquet schema read much slower than BlobClient?
First run this took 40 seconds +
#slow
with abfs.open('example-container/test.parquet') as f:
schema = pq.read_schema(f)
This which I'm pretty sure downloads the entire blob takes under half a second.
#fast
import io
with io.BytesIO() as byte_stream:
myBlobServiceClient.get_blob_client(
'example-container',
'test.parquet').download_blob().readinto(byte_stream)
schema = pq.read_schema(byte_stream)
but then if I run the abfs line again (even with a diferent file) it's fast! 101ms
#now it's fast
with abfs.open('example-container/test2.parquet') as f:
schema = pq.read_schema(f)
Am I doing something wrong? Is there some odd initialize or scan that's making the initial use of abfs so slow?
You will experience this delay only if your container contains a lot of blobs. The reason behind this is unnecessary caching of blob details from the whole container presented in https://github.com/fsspec/adlfs/blob/main/adlfs/spec.py#L751-L759
This should probably be fixed as it affects both .open() and .info() methods if the container is big. One can retrieve the file contents with .cat() which is not affected as it's not calling .info() but it's not ideal.
This is related to #248 because the info method is called for .open()
and would be fixed by PR #321. Though if one is specifying a path, I'm not sure why ls
is ever needed, try the path to the blob specified and then if it's not there raise FileNotFound.
I don't quite understand the need to scan the entire container when open()
has to specify the full path.
Well, I cloned and installed from the repo and I am now running adlfs-2022.10.0+4.gf15c37a
. My original example of :
with abfs.open('example-container/test2.parquet') as f:
schema = pq.read_schema(f)
still takes 40 seconds on first run when simply reading in the entire blob with get_blob_client
takes 1-2 seconds.
abfs.info('example-container/test2.parquet')
Similarly takes 40 seconds on first run. (then fast after that.)
The code in #360 certainly seems like it'd improve info
performance but I'm specifying the full path and it's still slow first run. Feels like it's scanning an entire container.
@astrowonk -- Can you share a reproducer? I've tried this with single file from a partitioned parquet dataset and I'm getting < 0.5 seconds for the open
operation, and <0.001 second for the info()
operation.
@astrowonk -- Can you share a reproducer? I've tried this with single file from a partitioned parquet dataset and I'm getting < 0.5 seconds for the
open
operation, and <0.001 second for theinfo()
operation.
I'll try; I need a public (?) container with lots of files. This is happening with a container with probably thousands of files in the container. How many other blobs are in your test container?
Hi @hayesgb. It's not really possible to make a simple test because you need a lot of files in the container to see performance problem. However if you can get the debugger to print DEBUG level statements (I have had no luck so I had to switch line 920 to a print) you'll see that it's calling ls
on the entire container for any .info
. The original fix in #321 would have re-ordered the calls such that this line:
out = await self._ls(path, invalidate_cache=invalidate_cache, **kwargs)
would run first. But the actual merged code in #360 still does this first:
out = await self._ls(
self._parent(fullpath), invalidate_cache=invalidate_cache, **kwargs
)
self.parent(fullpath)
in the parent container, so it is now calling ls
on the entire container which will be extremely slow if there are lots of files. The original re-ordering of #321 is preferred, where the first _ls
call is to the requested path without wrapping in self._parent
Honestly if I'm calling info
on a specific path, I don't quite understand why we'd ever want to ls
the entire container in response.. even if the full path was missing. I'd just raise a path not found error.
Adding more details here @hayesgb. The ordering from https://github.com/fsspec/adlfs/pull/321 is preferred over https://github.com/fsspec/adlfs/pull/360 as @astrowonk mentioned but there's another issue even with the corrected ordering.
It's when you call AzureBlobFileSystem.info()
for cached blobs
def info(self, path, refresh=False, **kwargs):
try:
fetch_from_azure = (path and self._ls_from_cache(path) is None) or refresh
except Exception:
fetch_from_azure = True
if fetch_from_azure:
return sync(
self.loop,
self._info,
path,
refresh,
version_id=kwargs.get("version_id"),
)
return super().info(path)
this will call AbstractFileSystem.info()
which contains the doomed ls
call on (possibly) large container and we're doing ls
on collection for every single blob again 😞
https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L614