cubed
cubed copied to clipboard
Naming tasks by passing xarray metadata
I was looking deeper into how to make https://github.com/pydata/xarray/issues/7813 work.
So looks like the nodes are named when they are created by Plan._new. Q's:
- Would it make sense to name a task corresponding to array creation using the corresponding xarray
DataArray.name? Other downstream tasks are composed from multiple arrays, so it's not clear how those should be named, but it seems reasonable to name initial nodes after the corresponding variable. - Is it bad idea to have the naming scheme for a cubed DAG change depending on whether or not the DAG was created by xarray wrapping cubed or just by cubed on its own?
- Why does
from_arraycreate 3 nodes at once? Intuitively I would have thought that does creating a single array not just correspond to a single node? I can see that the stack goes down throughblockwisebut I don't understand why that means 3 nodes. - If
from_arrayhad anameargument then I could passDataArray.nameto that from xarray. But the actualPlan._newcall doesn't happen until way deeper - presumably we don't want to pass anameargument all the way down throughmap_direct,map_blocks, andblockwise? - What's the distinction here between the name of the array and the name of the node? Is the name of the array really used for anything?
- Would it make sense to name a task corresponding to array creation using the corresponding xarray
DataArray.name? Other downstream tasks are composed from multiple arrays, so it's not clear how those should be named, but it seems reasonable to name initial nodes after the corresponding variable.
I'm not sure how this works out with there being a 1:1 mapping from variables to cubed arrays, but the thing to do is to try it and see if it works.
- Is it bad idea to have the naming scheme for a cubed DAG change depending on whether or not the DAG was created by xarray wrapping cubed or just by cubed on its own?
I don't think this is necessarily a problem. Also, rather than having xarray change the name of the cubed arrays it might just add some metadata that we could chose to display more prominently on the visualization of the DAG.
- Why does
from_arraycreate 3 nodes at once? Intuitively I would have thought that does creating a single array not just correspond to a single node? I can see that the stack goes down throughblockwisebut I don't understand why that means 3 nodes.
In the general case from_array could be any array-like object, so Cubed uses map_direct to read chunks from the array-like object. There is a fast-path for existing Zarr arrays, but it sounds like that that could be extended to Xarray-backed Zarr objects too (as long as we don't introduce a hard dependency on Xarray).
What is the array type that you are calling from_array on here?
- If
from_arrayhad anameargument then I could passDataArray.nameto that from xarray. But the actualPlan._newcall doesn't happen until way deeper - presumably we don't want to pass anameargument all the way down throughmap_direct,map_blocks, andblockwise?
We could probably pass the name through.
- What's the distinction here between the name of the array and the name of the node? Is the name of the array really used for anything?
A plan is a DAG where the nodes are named Zarr arrays. When the arrays are materialized to disk (not all of them are or need to be) they use that name in the filename.
Thanks for the explanations.
In the general case from_array could be any array-like object, so Cubed uses
map_directto read chunks from the array-like object. There is a fast-path for existing Zarr arrays, but it sounds like that that could be extended to Xarray-backed Zarr objects too (as long as we don't introduce a hard dependency on Xarray).What is the array type that you are calling from_array on here?
It depends on how you create the xarray object - see this notebook. Using .chunk calls .from_array on numpy array, whereas using open_dataset calls it on one of xarray's lazy indexing adapter classes. In both cases you get a graph with multiple nodes.
I also don't understand why calling .from_array on an array which only has one chunk still creates a graph with two initial dependency arrays - what are we mapping over here?
Closing since this was fixed in #192