xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Clarify difference between `.load()` and `.compute()`

Open jsignell opened this issue 1 year ago • 8 comments

What is your issue?

I just realized that the difference between .load() and .compute() is that .load() operates inplace and .compute() returns a new xarray object.I have 2 suggestions for how this could be clearer:

  1. Docs: the API docs for each method could reference the other.
  2. Code: this might be too big a change, but maybe .load() should not return anything. Consider this example from pandas:
    import pandas as pd
    
    df = pd.DataFrame({"air": []})
    df.rename({"air": "foo"}, axis=1, inplace=True)
    # returns None since df is renamed inplace
    
    this matches the behavior of inplace actions in Python itself like list.append or dict.update. This would be a major breaking change though, and it might be easier to just remove .load() entirely.

jsignell avatar Jul 27 '22 14:07 jsignell

Yeah, deprecating return values is not fun. We somewhat recently did that for Dataset.update and found that we could only mention this in the documentation / release notes, which is not great for visibility.

Interestingly, the docstring of Dataset.compute and Dataset.load compare with each other, while for the methods on DataArray and Variable that comparison is missing. None of them actually link to the other method, though.

Given that we have been removing inplace operations as much as possible, I'd vote to remove load entirely or make it an alias of compute.

Edit: up until I read this issue, I somehow assumed compute would only work with dask while load would also load our lazy array implementation into memory. Not sure how I got that impression, but maybe that's another argument to remove / align load?

keewis avatar Jul 27 '22 14:07 keewis

Thanks for engaging with this! I think removing load probably makes the most sense. Making load an alias for compute would just continue the confusion since there wouldn't be a clear right way to do things.

I somehow assumed compute would only work with dask while load would also load our lazy array implementation into memory. Not sure how I got that impression, but maybe that's another argument to remove / align load?

I'm not quite sure I follow. Are there lazy array implementations other than dask in xarray?

jsignell avatar Jul 27 '22 15:07 jsignell

Are there lazy array implementations other than dask in xarray?

No, sorry, I mistyped. What I was referring to is the lazy loading mechanism that will load the data into memory once you try to actually do something with it.

However, it might be possible to wrap other lazy array implementations if they implement either __array_ufunc__ and __array_function__, or __array_namespace__ (see #6807).

keewis avatar Jul 27 '22 15:07 keewis

👍🏾 for updating the docstring.

IME load is commonly used so I don't know that it's worth removing.

dcherian avatar Jul 27 '22 15:07 dcherian

I can imagine a scenario where load is de-emphasized in the narrative docs, and compute is used instead. It wouldn't have to be a formal deprecation, but it could be that compute is best practice.

jsignell avatar Jul 27 '22 17:07 jsignell

While there is overlap in the behavior here, I've always thought of these two methods as having distinct applications. .compute() is a Dask collection method. .load(), which predates Xarray's Dask integration, was originally meant to load data from our lazy loading backend arrays (e.g. a netCDF file).

Edit: up until I read this issue, I somehow assumed compute would only work with dask while load would also load our lazy array implementation into memory. Not sure how I got that impression, but maybe that's another argument to remove / align load?

This was my impression as well but now I understand that the primary difference is that load is inplace while compute is not.

jhamman avatar Jul 27 '22 18:07 jhamman

I should say that there might very well be other uses of load that I am not aware of. The inplace vs not is just the only difference that I noticed in the context of dask-backed xarray objects.

From this discussion I think for dask-backed objects we should recommend using compute. Maybe that is enough of a decision.

jsignell avatar Jul 27 '22 20:07 jsignell

note that the implementation of compute is currently:

def compute(self, **kwargs):
    new = self.copy(deep=False)
    return new.load(**kwargs)

which means that the inplace vs. "pure function" is literally the only difference (though I don't know whether doing a shallow copy has other implications).

keewis avatar Jul 27 '22 22:07 keewis