Cubed for non-array data?
@dcherian suggested this to me at SciPy, as did @applio. The idea is that although right now Cubed is scoped only to array computations, most of its design isn't really specific to arrays at all.
Which layers are array-specific?
Working through the layers in this diagram from the top down:
-
Array API - This part obviously is array-specific, and would need to be adapted somehow. You could either present an entire alternative dataframe API (there is a standard for this too in python), or maybe try to have an adapter layer that translates dataframe operations into array operations before executing that array plan.
-
Core Ops - The core operations in Cubed today (e.g. reduction) are intended for arrays, but often also make sense for use on 1D data (i.e. table columns). Dataframes might need a couple of other core ops to be implemented too, such as join.
-
Primitive Ops - Cubed's central insight is to express all array operations in terms of only two primitive ops: blockwise and rechunk, both of which can be performed with bounded memory usage. But neither blockwise nor rechunk seem to be fundamentally array-specific. The first is ultimately just a map over a set of chunks/partitions and the second is just a repartitioning of the data into a different pattern of chunks, both of which arguably also make sense in a tabular context.
-
Runtime - All of Cubed's Executors just execute sets of embarrasingly parallel pure functions. The runtime layer is therefore clearly not specific to arrays - the function executors that get called (e.g. Lithops) are already generally unaware that they are even being applied to stages of an array workload.
-
Storage - Currently Cubed serializes intermediate data to Zarr, but as has been pointed out in #727, we don't actually need to serialize to an array format. In fact we don't actually need to serialize to any existing format - we just need to save some kind of binary blob that cubed can read back when resuming a half-complete computation later. That's because the choice to save state as arrays is coincidental, what we are really doing is saving the entire state of the half-complete computation, in any format that Cubed can resume from. As Cubed is the only program that will need to read this state back, cubed is free to save the state in any format it likes. The only real requirement of the storage layer is that it is a key-value store that can hold arbitrary binary blobs, on a flexible variety of storage mediums, and at large scale. Zarr is a convenient pre-existing choice that fulfills those requirements, but we don't have to use zarr - we don't have use an array format at all. I think a minimal K-V store abstraction that fits this description is
obstore. (@applio could it be useful to make an obstore-like python store that wraps Dragon's distributed dict (cc @kylebarron) ...?)) If we're free to serialize anything we can obviously choose to serialize tabular data.(As an aside this storage generalization might be useful for other things too, e.g. performance, or serializing the state of other intermediate things that aren't arrays. @dcherian also pointed out that one of the strengths of Dask is it's ability to communicate arbitrary information between workers via pickling. Flox apparently pickles some dataclasses during some operations - for cubed the equivalent of that would be pickling the dataclasses and then writing them to a generic key-value storage layer.)
Would this work?
I think the main question here is whether or not the same trick of expressing all array operations in terms of only the two primitive ops can also be done for tabular data, or whether there is some important tabular operation that somehow breaks this model.
Would this be useful?
The tabular data processing space is far more crowded than the array processing space. One of the main selling points of Cubed - bounded memory usage to ensure reliability when completing larger-than-memory workloads - is already offered by DuckDB (see #492). However DuckDB cannot scale across multiple machines, so there is possibly still an unfulfilled niche here. The extreme flexibility and simplicity of Cubed's executors is also still appealing.
Note also that if we extended cubed to non-array workloads it's scope would become more similar to that of Dask, and therefore be another example of convergent evolution (see #570).
Bonus: If this would work, then are there any other data/computation models besides array/tabular/embarrasingly parallel that cubed's concepts could also be adapted for?
Thanks for opening this @TomNicholas. I agree with your analysis, and I'd also suggest building it in a separate project, and share the parts needed (storage, runtime, maybe primitives, etc). It's a lot of work 😄
- You could either present an entire alternative dataframe API (there is a standard for this too in python)
Looks like the Dataframe API standard is not being actively developed: https://github.com/data-apis/dataframe-api
- @dcherian also pointed out that one of the strengths of Dask is it's ability to communicate arbitrary information between workers via pickling.
We have a limited form of this to support Icechunk (to merge sessions), but I wonder if this needs generalising to support other cases? This seems separate to (and a lot easier than) full dataframe support.
The tabular data processing space is far more crowded than the array processing space.
Absolutely! It would be nice if there were more in the array processing space too.
However DuckDB cannot scale across multiple machines, so there is possibly still an unfulfilled niche here.
MotherDuck does this IIUC.
Looks like the Dataframe API standard is not being actively developed: data-apis/dataframe-api
I think the Arrow PyCapsule Interface more or less superseded the DataFrame API because it's much easier to implement for any Arrow-based library, and most modern DataFrame libraries are Arrow-based.