xarray-simlab icon indicating copy to clipboard operation
xarray-simlab copied to clipboard

Sparse arrays

Open jvail opened this issue 3 years ago • 13 comments

:+1: for the new release.

I was wondering if it would be laborious to support sparse (https://github.com/pydata/sparse) arrays. I have an issue with this line

https://github.com/benbovy/xarray-simlab/blob/master/xsimlab/stores.py#L193 dtype = getattr(value, "dtype", np.asarray(value).dtype)

I guess in case of sparse it would just be: dtype = getattr(value, "dtype", value.dtype)

But maybe after that line there are other issues as well. I just wanted to ask asap if you think sparse could be easily supported with a few tweaks. Because if not then I might just have to skip the idea of using sparse arrays at all.

Thank you, Jan

jvail avatar Jan 27 '21 05:01 jvail

Sorry for the late reply @jvail !

Yes it would be nice if xarray-simlab could support sparse arrays (I haven't tried yet). Xarray-simlab uses zarr to store the model outputs and uses xarray.open_zarr to load back the zarr store as a xarray Dataset. So the question is whether zarr and xarray.open_zarr would easily support that. There seems to be interest in that, but I don't think it could be supported with a few tweaks: https://github.com/zarr-developers/zarr-python/issues/152, https://github.com/pydata/sparse/issues/222, https://github.com/zarr-developers/zarr-python/issues/424.

benbovy avatar Feb 08 '21 08:02 benbovy

Sorry for the late reply @jvail !

No need to be sorry!

Alright - Thank you. I'll try to read the zarr related stuff and try a dev install of simlab to see how far I can get.

jvail avatar Feb 08 '21 10:02 jvail

Hi @benbovy ,

here is a test for some minimal sparse support: https://github.com/jvail/xarray-simlab/commit/107d2e0a2bebff4706043d9708f4fba3e6f2b53d#diff-932f643c875a75e89dbf205d2764a1ac25b260f5c12503543242f833e8359a62

It turns out it is impossible to use sparse with intent='inout' because zarr wont accept it and if you auto densify you are back at square one because you would have to "re-sparse" after reading the input.

It works well internally - meaning inside a process without having the variable either in input_vars or output_vars - without any changes in the source code.

If the sparse variable is a model output (in output_vars) you would have to call todense() on it (see my commit). Not sure if you want that. But at least it would provide a bit of support.

Another way to handle sparse in outputs could be to just drop all sparse when writing to zarr and issue a warning.

jvail avatar Feb 13 '21 04:02 jvail

Hi @jvail,

Thanks for your example.

Calling todense() before writing to zarr (and issue a warning) seems reasonable to me. It's better than no support at all.

Hopefully, better support for sparse arrays (and maybe other array backends) will be fixed (in zarr and xarray) upstream in the future.

In the meantime, maybe we could implement some ad-hoc conversion in xarray-simlab, i.e., for sparse arrays having coords and data saved as two zarr arrays, and then reconstruct the sparse COO array after loading the zarr dataset and before returning the xarray Dataset from Dataset.xsimlab.run().

benbovy avatar Feb 13 '21 13:02 benbovy

In the meantime, maybe we could implement some ad-hoc conversion in xarray-simlab, i.e., for sparse arrays having coords and data saved as two zarr arrays, and then reconstruct the sparse COO array after loading the zarr dataset and before returning the xarray Dataset from Dataset.xsimlab.run().

Thank you Benoit. That sounds like a pretty good idea. Should work with 'inout' as well - I'll give it a try.

jvail avatar Feb 13 '21 14:02 jvail

Salut @benbovy,

if you find time please take a look at this commit/branch: https://github.com/jvail/xarray-simlab/commit/cfa37a69796985beca19ffadcef70d8005ccb1d7

Instead of adding two zarr items - as you suggested - I tried to use the DOK format which nicely translates into a structured numpy array. That in turn is digestible by zarr.

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

There is a little notebook as well.

jvail avatar Mar 02 '21 14:03 jvail

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

Without having looked at your branch yet, my first idea would be to fix that in Xarray :-) !

benbovy avatar Mar 02 '21 14:03 benbovy

Unfortunately I could not get it working with the "infamous" mask_and_scale: True setting. Maybe you have an idea?

Without having looked at your branch yet, my first idea would be to fix that in Xarray :-) !

hmpf - that is beyond my reach. But maybe we can can get around it by injecting some encoding/decoding options - I hope.

jvail avatar Mar 03 '21 03:03 jvail

I just had a look at your https://github.com/jvail/xarray-simlab/commit/cfa37a69796985beca19ffadcef70d8005ccb1d7 branch. That's nice! I think that such support for sparse arrays would be a nice addition in xarray-simlab before proper support for sparse is available in zarr (and zarr->xarray).

A comment about your implementation: the approach that you use won't scale well if later we want to support other special cases. The VariableCoder approach used in Xarray would nicely fit here, even for one special case. Unfortunately it is not part (yet?) of the public API, but we could reuse the same approach here.

benbovy avatar Mar 08 '21 14:03 benbovy

A comment about your implementation: the approach that you use won't scale well if later we want to support other special cases. The VariableCoder approach used in Xarray would nicely fit here, even for one special case. Unfortunately it is not part (yet?) of the public API, but we could reuse the same approach here.

Yes, that's true. I'll try to propose something more general. But it seems you are d'accord with this approach - generally speaking? I'd be interested in moving this forward quickly so it might have a chance to be part of the next release.

Do you have an idea how to fix the mask_and_scale: True issue? Could it be that now xarray does complain about structured arrays coming from zarr? If so, I might be back at square one :\

jvail avatar Mar 09 '21 07:03 jvail

Yes generally I'm d'accord !

Do you have an idea how to fix the mask_and_scale: True issue? Could it be that now xarray does complain about structured arrays coming from zarr? If so, I might be back at square one :\

I'm afraid there's currently no workaround other than mask_and_scale: False. I don't think Xarray's CFMaskCoder supports fill values defined for structured dtypes, but I guess it wouldn't be hard to support it.

benbovy avatar Mar 09 '21 08:03 benbovy

I'm afraid there's currently no workaround other than mask_and_scale: False.

Apparently setting the zarr fill_value to None solves it.

jvail avatar Mar 09 '21 11:03 jvail

It turned out that we do not desperately need sparse arrays right now. With our current model/data we stay below a "critical" size (wherever that is) with our indices. But if they grow much larger in future and therefore arrays get sparser and sparser support for sparse would be very welcome. Also I'd like to wait until the dust has settled a bit around all the refactoring and new features in open PRs and a new version has been released.

jvail avatar Apr 05 '21 06:04 jvail