cubed icon indicating copy to clipboard operation
cubed copied to clipboard

Chunk dependency tracing: Re-compute only necessary chunks in Cubed plan

Open TomNicholas opened this issue 10 months ago • 8 comments

Concept

Icechunk solves the problem of handling incremental updates to Zarr stores, meaning that users can transparently track changes to datasets at the chunk level. Often real-world data pipelines involve computing some aggregate result from an entire input dataset(s), but currently if you change just one chunk in a store then to get the new result you likely have to recompute using the entire dataset. This is potentially massively wasteful if only part of the new result actually depends on chunks that were changed since the last version of the input dataset.

Can we use Cubed to automatically re-compute only the output chunks that actually depend on the updated input chunks?

This would be an extremely powerful optimization - in the pathological case the differences in the re-computed result might only depend on 1 or 2 updated chunks in the original dataset, so only 1 or 2 chunks need to be re-computed instead of re-computing the entire thing.

Cubed potentially has enough information in the plan to trace back up from the desired result all the way to which input chunks are actually necessary.

cc @rabernat (whose idea this was) @tomwhite @sharkinsspatial

TomNicholas avatar Dec 13 '24 16:12 TomNicholas