asdf icon indicating copy to clipboard operation
asdf copied to clipboard

`asdf` does not appear able to open cloud-hosted files using `fsspec` file handles

Open barentsen opened this issue 2 years ago • 3 comments

Summary

Using asdf v2.12.0, it does not appear possible for asdf to work with file handles which have been opened using fsspec. (See reproducible example and traceback appended below.)

Background - what is fsspec?!

Fsspec is a Python package which provides a file-like interface to remote files systems (e.g., web servers, Amazon S3, Google Cloud Storage, etc). For example, given the HTTP or S3 url of a file, fsspec enables that file to be accessed in an efficient way using standard random access file operations:

import fsspec
with fsspec.open(url).open() as fh:
    fh.seek(80)
    fh.read(10)

Behind the scenes, fh.read(10) will tend to trigger a buffered HTTP Range Request (or a protocol equivalent). As a result, fsspec enables small sections of large remote files to be accessed efficiently without downloading the entire file.

Because fsspec is rapidly becoming a standard way for Python libraries to interact with remote data (e.g., fsspec powers dask and pandas), I recently opened AstroPy PR https://github.com/astropy/astropy/pull/13238 to demonstrate how AstroPy could potentially leverage fsspec to work with cloud-hosted FITS files in an efficient way. This led me to evaluate whether asdf could work with fsspec as well.

Issue encountered

asdf does not appear to work with fsspec file handles out of the box at this time. For example the following snippet raises an AttributeError:

import asdf
import fsspec

fh = fsspec.open("example.asdf").open()  
af = asdf.open(fh)
data = af['mydata1'][100]

Traceback:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [1], in <cell line: 7>()
      5 fh = fsspec.open("example.asdf").open()
      6 af = asdf.open(fh)
----> 7 data = af['mydata1'][100]

File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/tags/core/ndarray.py:559, in _make_operation.<locals>.__operation__(self, *args)
    558 def __operation__(self, *args):
--> 559     return getattr(self._make_array(), name)(*args)

File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/tags/core/ndarray.py:263, in NDArrayType._make_array(self)
    260 else:
    261     dtype = self._dtype
--> 263 self._array = np.ndarray(shape, dtype, block.data, self._offset, self._strides, self._order)
    264 self._array = self._apply_mask(self._array, self._mask)
    265 if block.readonly:

File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/block.py:1224, in Block.data(self)
   1222     if not self._memmapped:
   1223         self._fd.seek(self.data_offset)
-> 1224         self._data = self._read_data(self._fd, self._size, self._data_size)
   1225 finally:
   1226     self._fd.seek(curpos)

File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/block.py:1110, in Block._read_data(self, fd, used_size, data_size)
   1106 """
   1107 Read the block data from a file.
   1108 """
   1109 if not self.input_compression:
-> 1110     return fd.read_into_array(used_size)
   1111 else:
   1112     return mcompression.decompress(fd, used_size, data_size, self.input_compression)

File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/asdf/generic_io.py:770, in MemoryIO.read_into_array(self, size)
    769 def read_into_array(self, size):
--> 770     buf = self._fd.getvalue()
    771     offset = self._fd.tell()
    772     result = np.frombuffer(buf, np.uint8, size, offset)

File ~/.pyenv/versions/asdf/lib/python3.9/site-packages/fsspec/implementations/local.py:353, in LocalFileOpener.__getattr__(self, item)
    352 def __getattr__(self, item):
--> 353     return getattr(self.f, item)

AttributeError: '_io.BufferedReader' object has no attribute 'getvalue'

Note: I created example.asdf as follows:

import asdf
import numpy as np

ARRAY_SIZE = 100_000_000

tree = {
    'name': 'testdata',
    'mydata1': np.arange(ARRAY_SIZE),
    'mydata2': np.arange(ARRAY_SIZE),
    'mydata3': np.arange(ARRAY_SIZE)
}

af = asdf.AsdfFile(tree)
af.write_to('example.asdf')

I have made no attempt yet to investigate what changes might be needed to make asdf compatible with fsspec. I'm opening this issue first to find out if that would be a good idea!

barentsen avatar Jun 08 '22 03:06 barentsen

This is a really interesting idea.

If you want to play with how to make this work I suggest looking in this module: https://github.com/asdf-format/asdf/blob/5bfe1731a258ebcb773bc8a70662a7daed03ebf9/asdf/generic_io.py#L1-L1072

It is where most of the "file" handling is abstracted for asdf. In any case, we are very happy to work with you further to make this a reality.

WilliamJamieson avatar Jun 08 '22 18:06 WilliamJamieson

Tagging @eslavich for his thoughts.

WilliamJamieson avatar Jun 08 '22 18:06 WilliamJamieson

this would be a great addition to ASDF 👍

CagtayFabry avatar Jul 13 '22 12:07 CagtayFabry

@barentsen, we recently merged #1226, which might fix your issues with using fsspec.

Would you mind pulling the development version of ASDF and seeing if it fixes your issues with fsspec (including more general use) and reporting if it works and/or if you encounter further compatibility issues?

WilliamJamieson avatar Nov 01 '22 16:11 WilliamJamieson