cubed Store intermediate data in a directory named after the compute ID

We should probably make Cubed store its intermediate data in a directory named {CONTEXT_ID}/{compute_id}, but that's a bit more work.

Originally posted by @TomNicholas in https://github.com/cubed-dev/cubed-benchmarks/pull/10#discussion_r1513284448

Mar 06 '24 17:03 TomNicholas

Saving temporary data from individual executions into different directories would be useful for benchmarking. This requires the ops to know both the CONTEXT_ID and the compute_id.

Currently the CONTEXT_ID is a global variable and hence always available, but the compute_id is generated when the plan is executed and passed to the executor. The ops functions don't see this, so I'm wondering what the best way to pass the compute_id down so it's available inside ops.py functions is? It seems wrong to add extra arguments to e.g. blockwise, but also seems bad to have a global variable that gets rewritten every time a new execution starts...

Mar 06 '24 17:03 TomNicholas

I think it's worse than that - the intermediate directories paths are created (not on the filesystem, just as strings) as the array functions are called, but before the computation is run - and therefore before the compute_id is created.

I think the easier way to solve the original problem in https://github.com/cubed-dev/cubed-benchmarks/pull/10 would be to just get the intermediate array paths from the DAG.

Mar 06 '24 17:03 tomwhite

Thinking about this more, it would be possible to change lazy_zarr_array to just take an array name ("array-001") not a store, and then turn it into the full path for the Zarr store only when it's created at the beginning of the computation. So it's possible, but still a fairly substantial change.

Mar 06 '24 18:03 tomwhite

I think it's worse than that - the intermediate directories paths are created (not on the filesystem, just as strings) as the array functions are called, but before the computation is run - and therefore before the compute_id is created.

So should we perhaps instead create only the known part of the directory path (i.e. the "prefixes") during plan construction time, and then join the compute_id to make a full path only once the execution begins?

I think the easier way to solve the original problem in https://github.com/cubed-dev/cubed-benchmarks/pull/10 would be to just get the intermediate array paths from the DAG.

So then the benchmark context managers need to know about the plan object right? Or can we add it to what's saved in history.plan?

Mar 06 '24 18:03 TomNicholas

Oh I didn't see your comment when I wrote mine - I think we're suggesting basically the same thing.

I agree this is probably overkill to get https://github.com/cubed-dev/cubed-benchmarks/pull/10 working, but I do think being able to distinguish different run directories might be useful in other contexts (e.g. perhaps an external tool whose job is to periodically purge temporary data from older runs).

Mar 06 '24 18:03 TomNicholas

I agree this is probably overkill to get cubed-dev/cubed-benchmarks#10 working, but I do think being able to distinguish different run directories might be useful in other contexts (e.g. perhaps an external tool whose job is to periodically purge temporary data from older runs).

Yes, that would be useful.

Mar 07 '24 11:03 tomwhite

cubed cubed copied to clipboard

Store intermediate data in a directory named after the compute ID

cubed
cubed copied to clipboard