cubed icon indicating copy to clipboard operation
cubed copied to clipboard

Naming tasks by passing xarray metadata

Open TomNicholas opened this issue 2 years ago • 2 comments

I was looking deeper into how to make https://github.com/pydata/xarray/issues/7813 work.

So looks like the nodes are named when they are created by Plan._new. Q's:

  • Would it make sense to name a task corresponding to array creation using the corresponding xarray DataArray.name? Other downstream tasks are composed from multiple arrays, so it's not clear how those should be named, but it seems reasonable to name initial nodes after the corresponding variable.
  • Is it bad idea to have the naming scheme for a cubed DAG change depending on whether or not the DAG was created by xarray wrapping cubed or just by cubed on its own?
  • Why does from_array create 3 nodes at once? Intuitively I would have thought that does creating a single array not just correspond to a single node? I can see that the stack goes down through blockwise but I don't understand why that means 3 nodes.
  • If from_array had a name argument then I could pass DataArray.name to that from xarray. But the actual Plan._new call doesn't happen until way deeper - presumably we don't want to pass a name argument all the way down through map_direct, map_blocks, and blockwise?
  • What's the distinction here between the name of the array and the name of the node? Is the name of the array really used for anything?

TomNicholas avatar May 06 '23 04:05 TomNicholas

  • Would it make sense to name a task corresponding to array creation using the corresponding xarray DataArray.name? Other downstream tasks are composed from multiple arrays, so it's not clear how those should be named, but it seems reasonable to name initial nodes after the corresponding variable.

I'm not sure how this works out with there being a 1:1 mapping from variables to cubed arrays, but the thing to do is to try it and see if it works.

  • Is it bad idea to have the naming scheme for a cubed DAG change depending on whether or not the DAG was created by xarray wrapping cubed or just by cubed on its own?

I don't think this is necessarily a problem. Also, rather than having xarray change the name of the cubed arrays it might just add some metadata that we could chose to display more prominently on the visualization of the DAG.

  • Why does from_array create 3 nodes at once? Intuitively I would have thought that does creating a single array not just correspond to a single node? I can see that the stack goes down through blockwise but I don't understand why that means 3 nodes.

In the general case from_array could be any array-like object, so Cubed uses map_direct to read chunks from the array-like object. There is a fast-path for existing Zarr arrays, but it sounds like that that could be extended to Xarray-backed Zarr objects too (as long as we don't introduce a hard dependency on Xarray).

What is the array type that you are calling from_array on here?

  • If from_array had a name argument then I could pass DataArray.name to that from xarray. But the actual Plan._new call doesn't happen until way deeper - presumably we don't want to pass a name argument all the way down through map_direct, map_blocks, and blockwise?

We could probably pass the name through.

  • What's the distinction here between the name of the array and the name of the node? Is the name of the array really used for anything?

A plan is a DAG where the nodes are named Zarr arrays. When the arrays are materialized to disk (not all of them are or need to be) they use that name in the filename.

tomwhite avatar May 11 '23 10:05 tomwhite

Thanks for the explanations.

In the general case from_array could be any array-like object, so Cubed uses map_direct to read chunks from the array-like object. There is a fast-path for existing Zarr arrays, but it sounds like that that could be extended to Xarray-backed Zarr objects too (as long as we don't introduce a hard dependency on Xarray).

What is the array type that you are calling from_array on here?

It depends on how you create the xarray object - see this notebook. Using .chunk calls .from_array on numpy array, whereas using open_dataset calls it on one of xarray's lazy indexing adapter classes. In both cases you get a graph with multiple nodes.

I also don't understand why calling .from_array on an array which only has one chunk still creates a graph with two initial dependency arrays - what are we mapping over here?

TomNicholas avatar May 24 '23 19:05 TomNicholas

Closing since this was fixed in #192

tomwhite avatar Jul 31 '24 16:07 tomwhite