adlfs icon indicating copy to clipboard operation
adlfs copied to clipboard

Xarray Serialisation Issues reading NetCDF from AzureBlobFile

Open alex-rakowski opened this issue 1 year ago • 3 comments

Trying to read a NetCDF file in xarray and running into serialisation issues.

AzureBlobFile object contains a SimpleQueue, which is non trivial to serialise. Suspect that fsspec should be handling the serialisation differently.

Simple Reproducer:

from distributed.protocol import serialize, ToPickle

storage_options = {'connection_string':***, 'account_key': ***}
fs = fsspec.filesystem('abfs',**storage_options)
url = "<CONTAINER_NAME>"
files = fs.ls(url)
ds = xr.open_dataset(
    fs.open(files[0], 'rb'),
    chunks={'x': 2000, 'y': 2000},
    engine='h5netcdf',
)
serialize(ToPickle(list(ds.variables.values())[0]._data.dask))

alex-rakowski avatar Jun 13 '24 09:06 alex-rakowski

Can you post the full traceback? What object has a reference to the queue?

TomAugspurger avatar Jun 13 '24 12:06 TomAugspurger

2024-06-13 12:48:57,917 - distributed.protocol.pickle - ERROR - Failed to serialize <ToPickle: HighLevelGraph with 2 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x31490b130>
 0. original-open_dataset-FSC-2bd87bcfc4ee55630c36125387cfd518
 1. open_dataset-FSC-2bd87bcfc4ee55630c36125387cfd518
>.
Traceback (most recent call last):
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 63, in dumps
    result = pickle.dumps(x, **dump_kwargs)
TypeError: cannot pickle 'weakref.ReferenceType' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 68, in dumps
    pickler.dump(x)
TypeError: cannot pickle 'weakref.ReferenceType' object

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 81, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/Users/arakowski/miniconda3/envs/pytorch-coiled/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object

the 'weakref.ReferenceType' object will sometimes show as SimpleQueue when doing something more realistic with the dataset than shown in simple reproducer.

alex-rakowski avatar Jun 13 '24 12:06 alex-rakowski

Thanks. We'll need to figure out which attributes of which objects aren't picklable. Some of these (like things from azure.storage.blob or azure.identity) might need to be pushed upstream. Others might need to be fixed here. Any research you can do here would be helpful.

TomAugspurger avatar Jun 19 '24 20:06 TomAugspurger