xarray
xarray copied to clipboard
Implement __sizeof__ on objects?
Is your feature request related to a problem? Please describe.
Currently ds.nbytes returns the size of the data.
But sys.getsizeof(ds) returns a very small number.
Describe the solution you'd like
If we implement __sizeof__ on DataArrays & Datasets, this would work.
I think that would be something like ds.nbytes + the size of the ds container, + maybe attrs if those aren't handled by .nbytes?
It seems like the concensus from https://bugs.python.org/issue15436 is that only C extension types should implement __sizeof__.
Mmm for better or worse, Dask relies on sizeof to estimate the memory usage of objects at runtime. We could move that over to some new duck-typed interface like using .nbytes if it's around, but not all objects will want to expose an nbytes attribute in their API.
IMO, I think the best path is for objects to implement __getsizeof__ unless there's some downside I'm missing.
I don't love going against the guidance from Python core developers. My gut is that a Dask-specific protocol may be safer. That said, if Dask is the only library using sys.getsizeof() for some real purpose, then perhaps this is safe enough.
There's still some ambiguity to me about exactly what should be included in "size of" (e.g., do we include lazy values or not?) but we can probably figure that out. I suspect Xarray's implementation would be need to be recursive in some way, to handle nested Dask or lazy arrays.
Should we close?
I would have voted towards having __sizeof__ return our current .nbytes — while it's not perfect, sys.getsizeof already returns a number, which is just completely wrong at the moment.
I wasn't sure whether the consensus at https://bugs.python.org/issue15436 was "non-C-extension-types shouldn't implement __sizeof__", vs. "C-extension-types should implement __sizeof__ but for other types it's not that helpful" ...and then maybe for dask it is helpful so it makes sense for us to do it.
But I don't have a strong view, happy to close if we've decided.
I'll reluctantly close given the weight I put on @shoyer 's view...
I just stumbled over this via https://github.com/dask/distributed/issues/5383 and wanted to add that there is also a way to tell dask what the size of an object is without overloading __sizeof__, e.g.
class MyCustomDataClass:
def __init__(self, nbytes):
self.nbytes = nbytes
from dask.sizeof import sizeof
@sizeof.register(MyCustomDataClass)
def sizeof_my_custom_data_class(obj):
# Do whatever you want here as long as it returns an integer
return obj.nbytes
I stumbled over this in https://github.com/pydata/xarray/issues/9088 again
Given that...
- There appears to be some reluctance of implementing a dunder method
- apparently all dask imports in xarray are lazy so implementing the sizeof dispatch might be awkward
I assume this is something that should be implemented in dask?
Looks like we can add an entrypoint: https://docs.dask.org/en/stable/how-to/extend-sizeof.html#extend-sizeof? That seems like it should work with lazy imports?
xarray is important enough that we can put this into dask itself. We can handle lazy registration there as well and are doing this for others important libs as well. The data structures in xarray are a bit complex which is why this stuff living here would make sense but I think if we needed to use endpoints, that wouldn't be worth it.
Thanks @fjetter
Can you post the example you're using to debug this?
There's a conceptual complexity as pointed out above:
There's still some ambiguity to me about exactly what should be included in "size of" (e.g., do we include lazy values or not?) but we can probably figure that out. I suspect Xarray's implementation would be need to be recursive in some way, to handle nested Dask or lazy arrays.
What does dask use this size for? Does it want the size in memory when all buffers are loaded? Or just the in-memory of the Xarray object at present (regardless of whether buffers are loaded or not)? An analog would be a zarr array for example, which is tiny but could become a very large array when loaded in to memory.
https://github.com/dask/dask/pull/11166 using .nbytes so it's using the in-memory size after all lazy arrays are loaded. It's also inaccurate for DataArrays sadly but we can fix that once we know what you want.
What does dask use this size for? Does it want the size in memory when all buffers are loaded? Or just the in-memory of the Xarray object at present (regardless of whether buffers are loaded or not)?
Dask wants to know what is currently in memory. We're using this mostly for scheduler heuristics, e.g. when submitting stuff or writing intermediate results things to disk. For most practical applications the difference will likely not matter much. Accuracy is also not incredibly important. The more accurate, the better, but as long as the rough order is accurate, that's good enough for us.
https://github.com/dask/dask/pull/11166 using .nbytes so it's using the in-memory size after all lazy arrays are loaded. It's also inaccurate for DataArrays sadly but we can fix that once we know what you want.
Yeah, this doesn't sound like what dask wants. We'll likely have to do something more complex then. In an earlier version of that PR I was just estimating sizes of a couple of (internal) attributes but that felt wrong. I'll open another PR with more tests and ping you for feedback.