s3fs
                                
                                
                                
                                    s3fs copied to clipboard
                            
                            
                            
                        Slow writes of Zarr files using S3Map
I'm noticing an order of magnitude difference in writing a zarr file to an S3 bucket using xarray.to_zarr(store=S3Map) (~10s) versus s3.put(local_store, remote_store) (~1s)
The data is public, simple demonstration here https://gist.github.com/scottyhq/effa642f00112971e2350d921a0aed9d
I notice the s3fs logs via Xarray have a lot of head_object and get_object calls compared to only put_object if using s3.put with a local copy.
I realize this may be considered an Xarray or Zarr Issue. However, I'm interested in interpreting this s3fs log and whether there are non-default settings to s3fs.S3FileSystem() or s3fs.S3Map() that are most appropriate for writing.
Truncated log:
22:47:12.167 DEBUG Get directory listing page for nasa-cryo-scratch/scottyhq/zarr_from_xarray.zarr
22:47:12.215 DEBUG Get directory listing page for nasa-cryo-scratch/scottyhq/zarr_from_xarray.zarr/.zgroup
22:47:12.229 DEBUG CALL: list_objects_v2 - () - {'MaxKeys': 1, 'Bucket': 'nasa-cryo-scratch'}
22:47:12.243 DEBUG CALL: put_object - () - {'Bucket': 'nasa-cryo-scratch', 'Key': 'scottyhq/zarr_from_xarray.zarr/.zgroup'}
22:47:12.272 DEBUG CALL: head_object - ({},) - {'Bucket': 'nasa-cryo-scratch', 'Key': 'scottyhq/zarr_from_xarray.zarr/.zarray'}
22:47:12.293 DEBUG Client error (maybe retryable): An error occurred (404) when calling the HeadObject operation: Not Found
22:47:12.293 DEBUG CALL: list_objects_v2 - ({},) - {'Bucket': 'nasa-cryo-scratch', 'Prefix': 'scottyhq/zarr_from_xarray.zarr/.zarray/', 'Delimiter': '/', 'MaxKeys': 1}
22:47:12.330 DEBUG CALL: get_object - () - {'Bucket': 'nasa-cryo-scratch', 'Key': 'scottyhq/zarr_from_xarray.zarr/.zgroup'}
22:47:12.352 DEBUG CALL: head_object - ({},) - {'Bucket': 'nasa-cryo-scratch', 'Key': 'scottyhq/zarr_from_xarray.zarr/longitude/.zarray'}
22:47:12.364 DEBUG Client error (maybe retryable): An error occurred (404) when calling the HeadObject operation: Not Found
22:47:12.364 DEBUG CALL: list_objects_v2 - ({},) - {'Bucket': 'nasa-cryo-scratch', 'Prefix': 'scottyhq/zarr_from_xarray.zarr/longitude/.zarray/', 'Delimiter': '/', 'MaxKeys': 1}
versions
fsspec: 2023.10.0
xarray: 2023.10.1
zarr  : 2.16.1
s3fs  : 2023.10.0
                                    
                                    
                                    
                                
At a glance, it seems the code is doing
- is this path a directory?
 - is this a group?
 - is this an array?
 
Clearly, when the intent is to overwrite, these are unnecessary, and some seem to query the path just written. The only thing you might want to do is rm(recursive), if you know you are writing a whole dataset from scratch.
s3fs can probably not do anything about this; whils you can prospectively pre-list a directory tree, because there are write ops in there too, the filecache will be regularly invalidated. Perhaps something clever could be done for updating an already-cached listing.
I am not immediately sure if xarray or zarr is making the calls. It would be interesting to see what zarr alone does. I tried the following:
In [19]: c = zarr.open_array(m, shape=(10, 10), chunks=(5, 5), mode="w", dtype="int64")
2023-11-06 21:13:32,243 - s3fs - DEBUG - _lsdir -- Get directory listing page for mymdtemp/zarr
2023-11-06 21:13:32,538 - s3fs - DEBUG - _lsdir -- Get directory listing page for mymdtemp/zarr/.zarray
2023-11-06 21:13:32,600 - s3fs - DEBUG - _call_s3 -- CALL: put_object - () - {'Bucket': 'mymdtemp', 'Key': 'zarr/.zarray'}
2023-11-06 21:13:32,720 - s3fs - DEBUG - _call_s3 -- CALL: get_object - () - {'Bucket': 'mymdtemp', 'Key': 'zarr/.zarray'}
and probably we see the whole behaviour here.
It's also worth noting that zarr is processing each variable/array sequentially, whereas these tiny metadata files can all be sent concurrently, as put() does. zarr has no operations that are concurrent across variables, only for get/putting multiple data chunks (even though s3fs supports it).
Maybe fsspec could provide a way to combine "caching" and "transactions" so that the effect of local-write->put() is conveniently achieved.
Clearly, when the intent is to overwrite, these are unnecessary, and some seem to query the path just written. The only thing you might want to do is rm(recursive)
Thanks for the detailed response @martindurant. The same requests are happening from Xarray even with an empty bucket to begin with.
probably we see the whole behaviour here.
Yes! Your minimal example captures the same log outputs as xr.to_zarr(), so I think we can focus on the interaction of Zarr and S3FS. I'm not sure why 'list', 'get', 'head' are necessary for the mode='w' case? Shouldn't there just be 'puts'? Would this require a change in s3fs or Zarr?
zarr is processing each variable/array sequentially, whereas these tiny metadata files can all be sent concurrently, as put() does... Maybe fsspec could provide a way to combine "caching" and "transactions" so that the effect of local-write->put() is conveniently achieved.
Interesting, thanks for clarifying! A streamlined local-write->async puts pattern could indeed be very useful. Is it possible to do this directly from the in-memory dataset with current fsspec functionality? I'm not sure how to utilize "transactions" to achieve this, but would be happy to test something out.
I tried last night to write some code around simplecache to be general for any backend, but I don't think it's doing the right thing - partly because zarr still wants to list and read from the target, where the new files don't yet exist.
It might be better to write better Transaction class(es) that know what to do with filesystems that have bulk put/commit operations (cc https://github.com/fsspec/filesystem_spec/issues/1416 ), but they need to be read/write.
It's also worth pointing out that zarr already supports separating the matadata mapper from the data mapper, so one could write a specialised deferred write thing that wraps FSStore.
https://github.com/fsspec/filesystem_spec/pull/1434 I think goes some way to resolving the problem: it caches writes during a transaction, putting them to remote in one concurrent operation, and rereading from local in the meantime. It does not address directory listing.