asdf
asdf copied to clipboard
`asdf` does not appear able to open cloud-hosted files using `fsspec` file handles
Summary
Using asdf
v2.12.0, it does not appear possible for asdf
to work with file handles which have been opened using fsspec
. (See reproducible example and traceback appended below.)
Background - what is fsspec?!
Fsspec is a Python package which provides a file
-like interface to remote files systems (e.g., web servers, Amazon S3, Google Cloud Storage, etc). For example, given the HTTP or S3 url of a file, fsspec
enables that file to be accessed in an efficient way using standard random access file operations:
import fsspec
with fsspec.open(url).open() as fh:
fh.seek(80)
fh.read(10)
Behind the scenes, fh.read(10)
will tend to trigger a buffered HTTP Range Request (or a protocol equivalent). As a result, fsspec enables small sections of large remote files to be accessed efficiently without downloading the entire file.
Because fsspec is rapidly becoming a standard way for Python libraries to interact with remote data (e.g., fsspec powers dask
and pandas
), I recently opened AstroPy PR https://github.com/astropy/astropy/pull/13238 to demonstrate how AstroPy could potentially leverage fsspec to work with cloud-hosted FITS files in an efficient way. This led me to evaluate whether asdf
could work with fsspec
as well.
Issue encountered
asdf
does not appear to work with fsspec
file handles out of the box at this time. For example the following snippet raises an AttributeError
:
import asdf
import fsspec
fh = fsspec.open("example.asdf").open()
af = asdf.open(fh)
data = af['mydata1'][100]
Traceback:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [1], in <cell line: 7>()
5 fh = fsspec.open("example.asdf").open()
6 af = asdf.open(fh)
----> 7 data = af['mydata1'][100]
File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/tags/core/ndarray.py:559, in _make_operation.<locals>.__operation__(self, *args)
558 def __operation__(self, *args):
--> 559 return getattr(self._make_array(), name)(*args)
File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/tags/core/ndarray.py:263, in NDArrayType._make_array(self)
260 else:
261 dtype = self._dtype
--> 263 self._array = np.ndarray(shape, dtype, block.data, self._offset, self._strides, self._order)
264 self._array = self._apply_mask(self._array, self._mask)
265 if block.readonly:
File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/block.py:1224, in Block.data(self)
1222 if not self._memmapped:
1223 self._fd.seek(self.data_offset)
-> 1224 self._data = self._read_data(self._fd, self._size, self._data_size)
1225 finally:
1226 self._fd.seek(curpos)
File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/block.py:1110, in Block._read_data(self, fd, used_size, data_size)
1106 """
1107 Read the block data from a file.
1108 """
1109 if not self.input_compression:
-> 1110 return fd.read_into_array(used_size)
1111 else:
1112 return mcompression.decompress(fd, used_size, data_size, self.input_compression)
File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/generic_io.py:770, in MemoryIO.read_into_array(self, size)
769 def read_into_array(self, size):
--> 770 buf = self._fd.getvalue()
771 offset = self._fd.tell()
772 result = np.frombuffer(buf, np.uint8, size, offset)
File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/fsspec/implementations/local.py:353, in LocalFileOpener.__getattr__(self, item)
352 def __getattr__(self, item):
--> 353 return getattr(self.f, item)
AttributeError: '_io.BufferedReader' object has no attribute 'getvalue'
Note: I created example.asdf
as follows:
import asdf
import numpy as np
ARRAY_SIZE = 100_000_000
tree = {
'name': 'testdata',
'mydata1': np.arange(ARRAY_SIZE),
'mydata2': np.arange(ARRAY_SIZE),
'mydata3': np.arange(ARRAY_SIZE)
}
af = asdf.AsdfFile(tree)
af.write_to('example.asdf')
I have made no attempt yet to investigate what changes might be needed to make asdf
compatible with fsspec
. I'm opening this issue first to find out if that would be a good idea!
This is a really interesting idea.
If you want to play with how to make this work I suggest looking in this module: https://github.com/asdf-format/asdf/blob/5bfe1731a258ebcb773bc8a70662a7daed03ebf9/asdf/generic_io.py#L1-L1072
It is where most of the "file" handling is abstracted for asdf. In any case, we are very happy to work with you further to make this a reality.
Tagging @eslavich for his thoughts.
this would be a great addition to ASDF 👍
@barentsen, we recently merged #1226, which might fix your issues with using fsspec
.
Would you mind pulling the development version of ASDF and seeing if it fixes your issues with fsspec
(including more general use) and reporting if it works and/or if you encounter further compatibility issues?