Is there a way to control the chunksizes during an aligment or reindex operation?

Open josephnowak opened this issue 1 year ago • 0 comments

Is your feature request related to a problem?

After the update 2024.08.02 of Dask the vindex method is keeping the chunksizes consistent, which I think can have a negative impact in certain scenarios on operations like reindex, reindex_like, and align on the Xarray side, for example, if we try to align a DataArray that is a small subset of another one, the number of chunks can increase drastically (see the example for a more clear overview), so my question is if there is a way to control this behavior to prevent the generation of huge graphs that affect the performance negatively.

import dask.array as da

n = 10000
m = 10000
big = xr.DataArray(
    da.zeros(shape=(n, m), chunks=(30, 6000)),
    dims=["a", "b"],
    coords={"a": list(range(n)), "b": list(range(m))}
)

n = 10
m = 10
small = xr.DataArray(
    da.zeros(shape=(n, m), chunks=(10, 10)),
    dims=["a", "b"],
    coords={"a": list(range(n)), "b": list(range(m))}
)

# This is going to generate 1000000 chunks
print(small.reindex_like(big).chunksizes)
print(xr.align(big, small, join="outer")[1].chunksizes)

Describe the solution you'd like

I propose that Xarray (not sure if this would be something better on the Dask side but they do not handle indexes as Xarray) could handle the alignment of the chunks in a more "sophisticated" way, it can be through a heuristic that decides the "ideal chunks" of the output, for example, use the biggest chunk of all the arrays as output, and add artificial data before reindexing (probably using the pad method would be ideal) to the small arrays to make them of at least the size of the "ideal chunk", this would guaranty that the number of chunks generated with the operation would be smaller, which would make the performance quite better.

Describe alternatives you've considered

No response

Additional context

No response

Oct 11 '24 17:10 josephnowak