tiled icon indicating copy to clipboard operation
tiled copied to clipboard

Error on .read() for 2D DataArrays

Open martintb opened this issue 1 year ago • 2 comments

Summary

When trying to read a 2D DataArray from a tiled catalog, a ValueError is raised because the length of the coordinate labels do not match the length of the dimensions of the data. This appears to be due to the fact that tiled.client.xarray._WideTableFetcher receives the 2D data as a flattened/raveled pandas.DataFrame.

Details

Tiled server: 0.1.0a104 Tiled client: 0.1.0a104 xarray: 2023.7.0

Reproduce Basic Error

(edit: apparently we can't attach notebooks to issues)

  1. Write an xarray.Dataset containing 2D DataArrays into a tiled catalog:
import xarray as xr
from tiled.client import from_uri
from tiled.client.xarray import write_xarray_dataset

c = from_uri('http://localhost:8000',api_key='.',structure_clients='dask')

ds = xr.Dataset() 
ds['X'] = xr.DataArray(np.random.random((128,128)),dims=['pixx','pixy'],coords={'pixx':np.arange(128),'pixy':np.arange(128)})

write_xarray_dataset(c,ds)
  1. Attempt to read Dataset/DataArrays
key = list(c.keys())[-1] # grab last key (assumes that keys are stored in order...not sure if this is true) 

c[key].read() 
# Throws ValueError: conflicting sizes for dimension 'pixx': length 16384 on 'pixx' and length 128 on {'pixx': 'X', 'pixy': 'X'} 

c[key]['X']
# Returns <DaskArrayClient shape=(128, 128) chunks=((128,), (128,)) dtype=float64 dims=('pixx', 'pixy')>

c[key].read('X')
# Returns xarray.Dataset without coordinate labels

c[key].read(optimize_wide_table=False)
# Returns xarray.Dataset with correct data variables and coordinate labels

optimize_wide_table

We noticed that when calling the underlying tiled.client.xarray._build_arrays method the coordinate dimensions were being returned incorrectly when optimize_wide_table=True

data_vars,coords = c[key]._build_arrays(['pixx'],optimize_wide_table=True)
np.array(coords['pixx'][1]).shape
# returns 16384

This result is somewhat opaque to identify because coords['pixx'][1] is a Dask delayed object which appears to be the correct shape=(128,) before being computed, but becomes shape=(16384,) after casting to np.ndarray:

image

Conversely when optimize_wide_table=False, we get the correct coordinate shape

data_vars,coords = c[key]._build_arrays(['pixx'],optimize_wide_table=False)
np.array(coords['pixx'][1]).shape
# returns 128

coords['pixx'][1] in this case is still a dask.delayed object, but its reported shape and the final computed shape match. The other difference is in the Dask graph field but I'm unsure of the consequence of this.

image

_WideTableFetcher

Digging a little deeper, we identify that the problem likely lies in the tiled.client.xarray._WideTableFetcher class which is returning a raveled/flattened version of the 2D data in the DataArray

from tiled.client.xarray import _WideTableFetcher
w = _WideTableFetcher(c[key].context.http_client.get,c[key].item['links']['full'])
w.register('pixx',c[key]['pixx'],c[key]['pixx'].structure())
w._fetch_variables('pixx')
# Returns flat pandas.DataFrame with shape (16384,2) and a raveled MultiIndex for pixx and pixy 

I'm happy to help debug this further but I got lost trying to figure out what's happening on the server during the get request in tiled.client.xarray._WideTableFetcher._fetch_variables.

martintb avatar Aug 25 '23 02:08 martintb

Update

I think the problem is 'simpler' than I initially thought. After taking a look through tiled.serialization.xarray, I think the issue is in the use of xarray.Dataset.to_dataframe. From the docstring of the to_dataframe method: "...The (returned) DataFrame is indexed by the Cartesian product of this dataset's indices."

This means that this dataset ds

image

becomes this df = ds.to_dataframe()

image

Simply undoing the DataFrame conversion with:

df.to_xarray()

image

returns a Dataset (or DataArray if df happens to be a pandas.Series) with the proper dimensions and coordinate labels. The question is whether, from a performance or conceptual perspective, this conversion to an xarray object is acceptable.

Alternatively, you can also manually extract your coords, dims, and data like so:

dim_lengths = [len(level) for level in df.index.levels]
data = df['X'].values.reshape(*dim_lengths)

coords = {level.name:level.values for level in df.index.levels}
dims = [level.name for level in df.index.levels]
xr.DataArray(data,dims=dims,coords=coords)

image

martintb avatar Aug 25 '23 19:08 martintb

Hey @martinb, I remember Peter mentioning this but I completely missed this GH Issue when it landed. I've got the tab open now, will follow up or delegate someone to follow up.

danielballan avatar Oct 11 '23 22:10 danielballan