cubed for awkward arrays
Greetings!
I would like to try integrated awkward array into cubed and there's an interesting clash since awkward array already handles array implementation backends (since it adds metadata operations atop to deal with ragged and highly structured data).
This would be towards developing a proposal for integration of awkward with cubed, as well as for my own understanding of your task graph api.
I've started hacking around a bit and this issue of the two backend systems immediately makes things a bit clunky. I would appreciate any tips or pointers on how might be best to continue. I'll find a way forward for sure, but anything to help dive in faster is certainly appreciated!
You'll probably find this xarray issue interesting, because the same restrictions for xarray there will also apply to cubed (i.e. it needs to know the shape and dtype). Essentially Awkward arrays are not arrays in the sense that the python array API standard means it.
However given that dask-awkward exists, it's conceivable that there is a totally different avenue of making cubed-awkward as another example of generalizing cubed beyond arrays (see #740) - though that would be far more work, if it's even possible.
Hi @lgray, thanks for opening this issue!
You can set the backend array API that Cubed uses for array operations in each task by setting CUBED_BACKEND_ARRAY_API_MODULE. E.g.
CUBED_BACKEND_ARRAY_API_MODULE=awkward pytest ...
Cubed has pretty strict expectations about array shapes - in particular that computed arrays have known shapes before the computation is run, which might be a challenge for awkward array!
Also, this repo by @keewis might be worth a look https://github.com/keewis/nested-duck-arrays
Proceeding towards something like cubed-awkward may be an end goal here. First I just want to try getting some very basic HEP data into your system (if I can manage it) and play around with your task graphs, which is my primary interest.
We learned a lot from building dask-awkward, and cubed reflects a lot of learning from dask as well. If we can control/reduce how much pure metadata re-arrangement gets encoded into the compute graph I think we can make a very much improved version of dask-awkward (especially exploiting the nice separation of concerns in the compute graph here).
I guess - really - I just want to see how a few things feel first. :-)
in particular that computed arrays have known shapes before the computation is run, which might be a challenge for awkward array!
The issue I linked is about how awkward doesn't even have integer tuple shapes as the computation is being run, nor can it associate a single dtype with one array. That completely breaks the array API. However a subset of awkward's functionality can be made to fit into the array API model - I believe that idea of mine led to the development of the ragged package. But even then the shape of a ragged.array can still be
>>> a.shape
(4, None, None)
which I don't think is of much use to cubed... (I guess cubed technically cannot wrap things quite as general as the full array API standard?)
So I think I agree that that just leaves the cubed-awkward idea. 🙂 Would love to hear what you learn @lgray !
Yeah my one initial thought for your memory bound guarantees is that for unknown lengths we can calculate confident intervals on the memory usage. i.e. set a fuzzy memory bound at 5 sigma on the mean of memory usage over all sampled columns. This is very fast to compute and maintains nearly all of the spirit of what you are try to achieve here.
A "fuzzy memory bound" is also related to the issue I just raised: https://github.com/cubed-dev/cubed/issues/749