xarray Dataset.from

Currently, to easily create a Dataset from an existing numpy recarray (not a DataArray, which is currently bugged anyway with recarrays due to #1434), I couldn't find an easier way than

df = xr.Dataset.from_dataframe(pd.DataFrame(my_recarray).set_index('foo'))

(which is kind of dumb since it allocates the memory twice)

It would definitely be nice to be able to do just this (perhaps with extra arguments to set index on the fly etc):

df = xr.Dataset.from_records(my_recarray, ...)

(Apologies if I'm missing something obvious.)

Mar 20 '19 00:03 aldanor

Turning a record array into a dict of arrays is pretty straightforward, e.g., arrays = {name: my_recarray[name] for name in my_recarray.dtype.names}

You could then pass this into xr.Dataset, but you'll also have to set dimension names.

Mar 20 '19 01:03 shoyer

I guess I expected it to “just work” since it’s a part of numpy core functionality. (same as you can just pass a recarray to pandas dataframe constructor and it infers the rest, without you having to create a dict of columns manually - there’s only one way to do it so it can be done automatically)

Mar 20 '19 02:03 aldanor

We could potentially pick dimension names automatically, but it's not an entirely obvious this to do since passing a dict of numpy arrays into the xarray.Dataset constructor isn't supported (but I guess we could support that).

Mar 20 '19 05:03 shoyer

For any future travelers who come across this: a slight twist on the patterns above is xr.Dataset({name: (("dim",), my_recarray[name]) for name in my_recarray.dtype.names}) — the data will be loaded on a single dimension, and then .set_index(dim=['a','b']) can be used to set the appropriate indexes vs variables.

May 13 '21 20:05 max-sixty

Dataset.from_records()?