iris icon indicating copy to clipboard operation
iris copied to clipboard

A Dataless Cube

Open bjlittle opened this issue 3 years ago • 10 comments

✨ Feature Request

I think it's healthy to challenge established norms...

I want the ability to create a dataless cube. By this I mean the ability to create a hyper-space defined only by metadata i.e., no data payload

Once data is added to the cube, then the dimensionality is established and locked down, as we traditionally know and accept.

Motivation

Such hyper-spaces could be used in various ways e.g.,

  • as a factory to manufacture fully formed cubes for test data
  • as the target hyper-space in a regridding or interpolation transformation

I'm sure there are more concrete use cases... Please do share them on this issue if you know or any 🙏

Traditionally, there are many situations where a cube enforcing that it must have data is simply an inconvenience. Given the natural progression of model resolutions it seems "just wrong" to abuse dask to create lazy data that will never be used. It reeks of something not being quite right to me.

Let's do something about that 😉

Please up vote this issue if you'd like to see this happen 👍

Steps

  • [ ] Complete work described in https://github.com/SciTools/iris/issues/4447#issuecomment-2374361681
  • [ ] Write up remaining 'visions' for a dataless Cube in separate issues. Encourage the users that have upvoted this issue to vote on one/more of the new issues if that outstanding work is important to them
  • [ ] Close this issue

bjlittle avatar Dec 02 '21 15:12 bjlittle

@bjlittle supermegahypercubes! That is, a cube that describes how huge numbers of incoming datasets would tile together to make an n-dimensional hyperstructure - think, for example, of representing an entire model run in a single object. This would ideally be represented as a metadata-only cube, with individual data payloads very much fetched on demand only, given the vast quantities of data such an object would represent.

We've considered this idea from a variety of different perspectives in the Informatics Lab, and we think it has legs. We've also given the idea a bunch of different names, but supermegahypercubes is the best, most whimsical and original name we came up with for the concept 🙂

DPeterK avatar Dec 02 '21 17:12 DPeterK

@bjlittle are you including here the idea that possibly only some of the data might be "filled", with some of it left unidentified. So, that might be closer to the idea previously suggested which I think was maybe called a "hypercube", probably in the Informatics Lab ? IIRC that was certainly raised before but we never managed to get around to seriously considering it. ( @DPeterK I can't find an issue link for this -- maybe can you help ? )

P.S. as a name, for that idea at least, I think "hypothicube" is neater (though for language purists that should probably be "hypothecube" 😉 )

pp-mo avatar Dec 19 '21 10:12 pp-mo

@bjlittle - re your concrete use-cases: If useful to see some (~pedestrian, non hyp[er|o]cube-y) code-in-wild examples of target hyperspace for interpolation/regridding, I've got a couple here (sorry, only viewable internally@MO). Almost certainly not optimal, but guessing poss still useful to see non-expert usage!

  • Adding a np.zeros .data cargo when defining target cube for a model->model regrid, cargo immediately getting discarded when interpolation/regridding applied. Used here. In this case just a very small cargo, so not especially wasteful to create/discard, but can see that in other cases would be!
  • Alternatively, a similar ~pointless NaNing of data here, for some obs->model comparisons, where the target cube (to eventually accept some interpolated observations) was based off another (model data). Subsequent wrangling of metadata to remove entries irrelevant in the obs data case.

edmundhenley-mo avatar Mar 17 '22 16:03 edmundhenley-mo

@pp-mo - dunno re issue, but wonder if you're recalling the part-filled example in Jacob's hypotheticube article? Or poss another Informatic Lab article? (here's @DPeterK 's one on supermegahypercubes

edmundhenley-mo avatar Mar 17 '22 16:03 edmundhenley-mo

I feed streams of cubes through Machine Learning software (TensorFlow - TF). This requires throwing away the metadata and operating only on the data arrays, and then laboriously reconstructing metadata around the output data. It would be great to be able to cut a cube into data and metadata components, process them seperately and recombine them later.

philip-brohan avatar Oct 06 '22 09:10 philip-brohan

In Dragon Taming :tm: discussion today, I suggested that we should AFAP "contain" code changes within the DataManager class, i.e. no or minimal change should be required in Cube code.

Just as a hint for implementation, it is also very simple to make a lazy array which has no data, so can participate normally in any lazy operations, but can't be fetched. You just need an object which supports : shape, dtype, ndim and __getitem__, and you wrap it with dask.array.from_array : I've written code like this a few times, now !

Here's a simple working example.

import dask.array as da
import numpy as np

class FakeArray:
    def __init__(self, shape, dtype):
        if not isinstance(dtype, np.dtype):
            dtype = np.dtype(dtype)
        self.dtype = dtype
        self.shape = shape
        self.ndim = len(shape)  # Dask requires ndim as well as shape, for some reason

    def __getitem__(self, keys):
        raise ValueError("FakeArray cannot be read.")

def lazy_fake(shape, dtype=np.float64):
    """A functional lazy array with known shape and dtype, but no actual data."""
    arr = FakeArray(shape, dtype)
    # Note: must pass 'meta' to from_array, to prevent it making a test data access
    meta = np.zeros((), dtype=arr.dtype)
    return da.from_array(arr, meta=meta)
>>> my_fake = lazy_fake((3, 4), 'i2')
>>> print('fake = ', my_fake)
fake =  dask.array<array, shape=(3, 4), dtype=int16, chunksize=(3, 4), chunktype=numpy.ndarray>
>>> print('fake.meta = ', repr(my_fake._meta))
fake.meta =  array([], shape=(0, 0), dtype=int16)
>>> print('fake[0] = ', my_fake[0])
fake[0] =  dask.array<getitem, shape=(4,), dtype=int16, chunksize=(4,), chunktype=numpy.ndarray>
fake[0] =  dask.array<getitem, shape=(4,), dtype=int16, chunksize=(4,), chunktype=numpy.ndarray>
Traceback (most recent call last):
  File "/home/h05/itpp/Support/periods/period_20240710_ugridsprintx1/dev/fake_arrays.py", line 29, in <module>
    print(my_fake.compute())
          ^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/iris3/lib/python3.11/site-packages/dask/base.py", line 342, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/persistent/newconda-envs/iris3/lib/python3.11/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/h05/itpp/Support/periods/period_20240710_ugridsprintx1/dev/fake_arrays.py", line 14, in __getitem__
    raise ValueError("FakeArray cannot be read.")
ValueError: FakeArray cannot be read.
>>> 

pp-mo avatar Jul 01 '24 11:07 pp-mo

To clarify my (mis)understanding of what you mean @pp-mo - the DataManager class is in user-space code? i.e. user-written and maintained, not part of iris?

edmundhenley-mo avatar Jul 01 '24 11:07 edmundhenley-mo

To clarify my (mis)understanding of what you mean @pp-mo - the DataManager class is in user-space code? i.e. user-written and maintained, not part of iris?

Ah no, not that actually. The DataManager is absolutely a part of Iris. It encapsulates the different types of array content that we can have in a cube.data or coord.points/bounds + gives them a common API. For now, that basically means real or lazy array.

So I was just hoping that, since we have already have this class encapsulating the possible array types, it would be neat if we can support "dataless" purely by extending what a DataManager can do, rather than by making a bunch of changes elsewhere, e.g. in the Cube class.

pp-mo avatar Jul 01 '24 15:07 pp-mo

P.S. further clarification (hopefully) My previous code example is also suggesting that it might be possible to implement dataless content as "just a special lazy array".
It's not yet clear if it can be quite that simple, though.
And even if it can, we might still want to distinguish "dataless" content in a more definite way.

pp-mo avatar Jul 01 '24 15:07 pp-mo