xarray icon indicating copy to clipboard operation
xarray copied to clipboard

BackendEntrypoint bug chunking data

Open timvgl opened this issue 1 year ago • 5 comments

What happened?

Hi together, I have written a class for the BackendEntrypoint. Since the data can be quite large, I wrote huge parts by making use of the dask.delayed function. As one of the last steps I am redefining the chunks. The code is loading single files and concatenates them into a xarray. As a result, it makes sense to chunk for each single file. This chunking is not forwarded to the main program, where the class is used: dset = dset.chunk(dict(zip(dims, list(reversed([list(reversed(list(shape)))[i] if i < 4 else 1 for i in range(len(list(shape)))]))))) return dset https://github.com/timvgl/mumaxXR/blob/daa4345b482197b65a7bd88df1552d9fa48b8bf7/src/mumaxXR/OvfEngine.py#L356-L357 Between dset = .... and return dset dset.chunksizes gives the expcted values. However, in the main program, the chunking is not available anymore and dset.chunksizes gives Frozen({}). Thanks for your help! Tim

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

xarray version: 2023.1.0 Linux

timvgl avatar Mar 06 '24 16:03 timvgl

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Mar 06 '24 16:03 welcome[bot]

Thanks for the issue, but we really need a more focused example to make progress here

max-sixty avatar Mar 06 '24 19:03 max-sixty

I agree that debugging the somewhat complex piece of code you linked to is a bit much to ask for an issue (especially given that we're already somewhat behind in answering some issues).

However, glancing over the code, I would point you to the preferred chunksizes section in the docs, in particular this sentence:

Note that the backend is not directly involved in Dask chunking, because Xarray internally manages chunking.

In other words, your backend is supposed to use the lazy indexing classes mentioned in that document, and you will get the dask handling for free. If that doesn't work for you (for whatever reason), remember that having a function separate from xarray's backend machinery is totally fine, as well.

keewis avatar Mar 06 '24 20:03 keewis

Of course, debugging that code is not what I wanted you to do - you guys have different things to do. Thanks for the hint with the lazy indexing classes. I am gonna check that out. I wrote a shorter example to show that this is a more general problem. I am not sure whether chunking within the backend making use of the chunking mechanism provided by xarray is supposed to be supported.

import xarray as xr
class TestEngine(xr.backends.BackendEntrypoint):
    def open_dataset(
        self,
        filename_or_obj,
        *,
        drop_variables=None,
    ):
        dataset = xr.open_dataset(filename_or_obj)
        chunkDict = {}
        for dim in dataset.dims:
            chunkDict[dim] = 1
        dataset = dataset.chunk(chunkDict)
        print(dataset.chunksizes)
        return dataset

dataset = xr.open_dataset(path_nc_file, engine=TestEngine)
print(dataset.chunksizes)

In this case, chunking from xarray in the backend is not forwarded. This is a decision you just have to make. If this is not wanted, this issue can be closed from my side.

timvgl avatar Mar 06 '24 20:03 timvgl

thanks for the minimal example. That reduces this issue to just "Should we allow backends to return chunked datasets?", which is much easier to discuss.

Back when we introduced the backends the current behavior was intentional, but that doesn't mean we can't change it (we'll first have to figure out whether that's a good idea, though).

keewis avatar Mar 06 '24 21:03 keewis

The code is loading single files and concatenates them into a xarray.

Why don't you use open_mfdataset instead? That's xarrays recommended way of opening multiple files.

There's a nice tutorial of how to set up lazy handling of backends here: https://tutorial.xarray.dev/advanced/backends/2.Backend_with_Lazy_Loading.html

Illviljan avatar Mar 18 '24 17:03 Illviljan

Yeah I have updated the code already making use of the lazy loading structure. I did use the mf function in the past already. It is just more conviniend if the user doesn't have to define args that are the same always (like parallel=True, concat_dim='t' etc.). There are also some other reasons for using the original function instead of mf, but this would go to far.

timvgl avatar Mar 18 '24 17:03 timvgl