tiled
tiled copied to clipboard
Error on .read() for 2D DataArrays
Summary
When trying to read a 2D DataArray from a tiled catalog, a ValueError
is raised because the length of the coordinate labels do not match the length of the dimensions of the data. This appears to be due to the fact that tiled.client.xarray._WideTableFetcher
receives the 2D data as a flattened/raveled pandas.DataFrame.
Details
Tiled server: 0.1.0a104 Tiled client: 0.1.0a104 xarray: 2023.7.0
Reproduce Basic Error
(edit: apparently we can't attach notebooks to issues)
- Write an
xarray.Dataset
containing 2D DataArrays into a tiled catalog:
import xarray as xr
from tiled.client import from_uri
from tiled.client.xarray import write_xarray_dataset
c = from_uri('http://localhost:8000',api_key='.',structure_clients='dask')
ds = xr.Dataset()
ds['X'] = xr.DataArray(np.random.random((128,128)),dims=['pixx','pixy'],coords={'pixx':np.arange(128),'pixy':np.arange(128)})
write_xarray_dataset(c,ds)
- Attempt to read Dataset/DataArrays
key = list(c.keys())[-1] # grab last key (assumes that keys are stored in order...not sure if this is true)
c[key].read()
# Throws ValueError: conflicting sizes for dimension 'pixx': length 16384 on 'pixx' and length 128 on {'pixx': 'X', 'pixy': 'X'}
c[key]['X']
# Returns <DaskArrayClient shape=(128, 128) chunks=((128,), (128,)) dtype=float64 dims=('pixx', 'pixy')>
c[key].read('X')
# Returns xarray.Dataset without coordinate labels
c[key].read(optimize_wide_table=False)
# Returns xarray.Dataset with correct data variables and coordinate labels
optimize_wide_table
We noticed that when calling the underlying tiled.client.xarray._build_arrays
method the coordinate dimensions were being returned incorrectly when optimize_wide_table=True
data_vars,coords = c[key]._build_arrays(['pixx'],optimize_wide_table=True)
np.array(coords['pixx'][1]).shape
# returns 16384
This result is somewhat opaque to identify because coords['pixx'][1] is a Dask delayed object which appears to be the correct shape=(128,) before being computed, but becomes shape=(16384,) after casting to np.ndarray:
Conversely when optimize_wide_table=False
, we get the correct coordinate shape
data_vars,coords = c[key]._build_arrays(['pixx'],optimize_wide_table=False)
np.array(coords['pixx'][1]).shape
# returns 128
coords['pixx'][1] in this case is still a dask.delayed object, but its reported shape and the final computed shape match. The other difference is in the Dask graph field but I'm unsure of the consequence of this.
_WideTableFetcher
Digging a little deeper, we identify that the problem likely lies in the tiled.client.xarray._WideTableFetcher
class which is returning a raveled/flattened version of the 2D data in the DataArray
from tiled.client.xarray import _WideTableFetcher
w = _WideTableFetcher(c[key].context.http_client.get,c[key].item['links']['full'])
w.register('pixx',c[key]['pixx'],c[key]['pixx'].structure())
w._fetch_variables('pixx')
# Returns flat pandas.DataFrame with shape (16384,2) and a raveled MultiIndex for pixx and pixy
I'm happy to help debug this further but I got lost trying to figure out what's happening on the server during the get request in tiled.client.xarray._WideTableFetcher._fetch_variables
.
Update
I think the problem is 'simpler' than I initially thought. After taking a look through tiled.serialization.xarray
, I think the issue is in the use of xarray.Dataset.to_dataframe
. From the docstring of the to_dataframe
method: "...The (returned) DataFrame is indexed by the Cartesian product of this dataset's indices."
This means that this dataset ds
becomes this df = ds.to_dataframe()
Simply undoing the DataFrame conversion with:
df.to_xarray()
returns a Dataset (or DataArray if df
happens to be a pandas.Series
) with the proper dimensions and coordinate labels. The question is whether, from a performance or conceptual perspective, this conversion to an xarray object is acceptable.
Alternatively, you can also manually extract your coords, dims, and data like so:
dim_lengths = [len(level) for level in df.index.levels]
data = df['X'].values.reshape(*dim_lengths)
coords = {level.name:level.values for level in df.index.levels}
dims = [level.name for level in df.index.levels]
xr.DataArray(data,dims=dims,coords=coords)
Hey @martinb, I remember Peter mentioning this but I completely missed this GH Issue when it landed. I've got the tab open now, will follow up or delegate someone to follow up.