cubed
cubed copied to clipboard
Store intermediate data in a directory named after the compute ID
We should probably make Cubed store its intermediate data in a directory named
{CONTEXT_ID}/{compute_id}, but that's a bit more work.
Originally posted by @TomNicholas in https://github.com/cubed-dev/cubed-benchmarks/pull/10#discussion_r1513284448
Saving temporary data from individual executions into different directories would be useful for benchmarking. This requires the ops to know both the CONTEXT_ID and the compute_id.
Currently the CONTEXT_ID is a global variable and hence always available, but the compute_id is generated when the plan is executed and passed to the executor. The ops functions don't see this, so I'm wondering what the best way to pass the compute_id down so it's available inside ops.py functions is? It seems wrong to add extra arguments to e.g. blockwise, but also seems bad to have a global variable that gets rewritten every time a new execution starts...
I think it's worse than that - the intermediate directories paths are created (not on the filesystem, just as strings) as the array functions are called, but before the computation is run - and therefore before the compute_id is created.
I think the easier way to solve the original problem in https://github.com/cubed-dev/cubed-benchmarks/pull/10 would be to just get the intermediate array paths from the DAG.
Thinking about this more, it would be possible to change lazy_zarr_array to just take an array name ("array-001") not a store, and then turn it into the full path for the Zarr store only when it's created at the beginning of the computation. So it's possible, but still a fairly substantial change.
I think it's worse than that - the intermediate directories paths are created (not on the filesystem, just as strings) as the array functions are called, but before the computation is run - and therefore before the compute_id is created.
So should we perhaps instead create only the known part of the directory path (i.e. the "prefixes") during plan construction time, and then join the compute_id to make a full path only once the execution begins?
I think the easier way to solve the original problem in https://github.com/cubed-dev/cubed-benchmarks/pull/10 would be to just get the intermediate array paths from the DAG.
So then the benchmark context managers need to know about the plan object right? Or can we add it to what's saved in history.plan?
Oh I didn't see your comment when I wrote mine - I think we're suggesting basically the same thing.
I agree this is probably overkill to get https://github.com/cubed-dev/cubed-benchmarks/pull/10 working, but I do think being able to distinguish different run directories might be useful in other contexts (e.g. perhaps an external tool whose job is to periodically purge temporary data from older runs).
I agree this is probably overkill to get cubed-dev/cubed-benchmarks#10 working, but I do think being able to distinguish different run directories might be useful in other contexts (e.g. perhaps an external tool whose job is to periodically purge temporary data from older runs).
Yes, that would be useful.