heat
heat copied to clipboard
Support distribution of `xarray`
See https://docs.xarray.dev/en/stable/
If I understand correctly, an xarray
object is made up of the actual data
array (np.ndarray), and ~~1-D~~ coordinates
arrays (dictionaries?) that map data
dimensions and indices to meaningful physical quantities.
For example, if xarray
is a matrix of coordinates (date, temperature)
, users will be able to perform
mean_temp = xarray['2010_01_01', '2010_12_31'].mean()
Feature functionality
Enable distribution of xarray object, allow named dimensions, keep track of coordinates
arrays, one of which will be distributed.
Example, :
ht_xarray = ht.array(xarray, split="date")
ht_mean_temp = ht_xarray['2010_01_01', '2010_12_31'].mean() # memory-distr. operation
Check out Pytorch's named tensors functionality.
Additional context Initiating collaboration with N. Koldunov @koldunovn at Alfred Wegener Institute (Helmholtz centre for polar and marine research). Also interesting for @kleinert-f, @ben-bou Tagging @bhagemeier for help with implementation.
Coordinates are not necessarily 1D arrays. For example for curvilinear grids covering the Earth's surface one would have to describe positions of (e.g. centers) of grid points as 2D arrays (one for lon
and one for lat
), and coordinate arrays then will be 2D.
Got it, thanks @koldunovn, edited original post.
Implementation discussed during devs meeting Sept 28. Preliminary scheme:
- introduce
dndarray.coordinates
attribute (default=None
) -
ht.array()
to acceptxarray
input, interpret stringsplit
, populatedndarray.coordinates
, chunk data AND split coordinate - every operation needs to check
if not coordinates:
or similar, quantify and minimize performance hit
@Markus-Goetz would be good if you chime in
tagging @Mystic-Slice as they have shown interest in working on this :wave:
Interesting discussion over at array-api about single-node parallelism, but xarray also mentioned in distributed execution context.
@TomNicholas you might be interested in this effort as well.
Hi everyone! I have been thinking about this project for some time now and here are my initial thoughts.
DXarray:
Xarray uses np.ndarray
for its data storage. And since numpy does not support computation on GPU, the xarray objects cannot be directly chunked and stored within the new DXarray
object.
There are two solutions:
- Using CuPy arrays within xarray. see (cupy-xarray)
- As far as I know, this must be very simple and straightforward to use.
- This sounds like the best way to get most if not all the functionalities of xarray without having to re-implement.
- My only problem with CuPy is that, it is strictly GPU-based. It will limit flexibility if we dont switch between numpy and cupy throughout the code. (But maybe cupy-xarray handles that internally and we dont have to worry about that?)
- Implement our own version of
xarray
usingtorch.Tensor
s.- Except for a few functionalities (like array broadcasting, grouping and data alignment), everything else must be easy enough to implement by just translating the labels to corresponding indices and redirecting to the existing
DNDarray
methods. - Example:
>>> xarray = xr.DataArray( data=np.arange(48).reshape(4, 2, 6), dims=("u", "v", "time"), coords={ "u": [-3.2, 2.1, 5.3, 6.5], "v": [-1, 2.6], "time": pd.date_range("2009-01-05", periods=6, freq="M"), }, ) # Single select >>> arr.sel(u=5.3, time="2009-04-30") # array([27, 33]) # Translated to >>> arr[2, :, 3] # array([27, 33]) # Multi select >>> arr.sel(u=[-3.2, 6.5], time=slice("2009-02-28", "2009-03-31")) # array([[[ 1, 2], [ 7, 8]], [[37, 38], [43, 44]]]) # Translated to >>> arr[[0, 3], :, slice(1, 2 + 1)] # array([[[ 1, 2], [ 7, 8]], [[37, 38], [43, 44]]])
- But this means we will have to implement any new feature/functionality that is released by Xarray in the future.
- Except for a few functionalities (like array broadcasting, grouping and data alignment), everything else must be easy enough to implement by just translating the labels to corresponding indices and redirecting to the existing
I would love to know what you all think. Any suggestion is highly appreciated!
Hi everyone, thanks for tagging me here!
I'm a bit unclear what you would like to achieve here - are you talking about:
(a) making DNDArray
wrap xarray objects?
(b) wrapping DNDArray
inside xarray objects? (related Q: does DNDArray
obey the python array API standard?)
(c) wrapping DNDArray
inside xarray objects but also with an understanding of chunks
? (similar to dask - this is what the issue you linked to is discussing @ClaudiaComito)
(d) just creating DNDArray
objects from xarray objects?
Some miscellaneous comments:
It will limit flexibility if we dont switch between numpy and cupy throughout the code.
This is the point of the get_array_module
pattern - you should then not have to have separate code paths for working with GPU vs CPU data.
Implement our own version of xarray using
torch.Tensors
There have been some discussions of wrapping pytorch tensors in xarray. We also have an ongoing project to publicly expose xarray.Variable
, a lower-level abstraction better-suited for use as a building block of efforts to make named tensors like torch.Tensor
. In a better-coordinated world this would have already happened, and pytorch would now have an optional dependency on xarray. :man_shrugging:
Hi everyone, thanks for tagging me here!
I'm a bit unclear what you would like to achieve here - are you talking about:
(a) making
DNDArray
wrap xarray objects? (b) wrappingDNDArray
inside xarray objects? (related Q: doesDNDArray
obey the python array API standard?) (c) wrappingDNDArray
inside xarray objects but also with an understanding ofchunks
? (similar to dask - this is what the issue you linked to is discussing @ClaudiaComito) (d) just creatingDNDArray
objects from xarray objects?Some miscellaneous comments:
It will limit flexibility if we dont switch between numpy and cupy throughout the code.
This is the point of the
get_array_module
pattern - you should then not have to have separate code paths for working with GPU vs CPU data.Implement our own version of xarray using
torch.Tensors
There have been some discussions of wrapping pytorch tensors in xarray. We also have an ongoing project to publicly expose
xarray.Variable
, a lower-level abstraction better-suited for use as a building block of efforts to make named tensors liketorch.Tensor
. In a better-coordinated world this would have already happened, and pytorch would now have an optional dependency on xarray. 🤷♂️
Hi @TomNicholas! We want to create a Distributed-Xarray. This should be able to be chunked and distributed to different processes. Most users would also prefer to use GPU-based computation. So, I guess we are trying to achieve option (a) or create a new distributed data structure of our own that mimics Xarray (sounds like a lot of work😅).
This should allow users to manipulate huge amounts of data while also being able to work with the more-intuitive Xarray API.
@ClaudiaComito should be able to shed more light on this. Thanks for taking the time to speak with us!
Would the experience of xarray with Dask make creation of the data structure you want? Also there are implementations with GPU support https://xarray.dev/blog/xarray-kvikio
Meanwhile I have implemented a basic idea of DXarray
in the corresponding branch. In the near future I plan to go on with:
-
DXarray.to_xarray()
: convert distributedDXarray
to axarray.DataArray
(similar toDNDarray.to_numpy()
) -
resplit
forDXarray
-
from_xarray': convert a non-distributed
xarray.DataArrayto distributed
DXarray` - parallel I/O for
DXarray
(depending on how this can be done for anxarray
-like structure) - arithmetic and statistical operations that act on the
value
-array only (such as mean, etc.)
Things that might get a bit more complicated:
- pandas-dataframes as coordinates
- masked value-arrays
- operations that involve values and coordinates and thus cannot be copied directly from routines for DNDarray applied to the value array
@ClaudiaComito in #1154 it seems you started wrapping heat objects inside xarray, which is awesome! I recently improved the documentation on wrapping numpy-like xarrays with xarray objects (https://github.com/pydata/xarray/pull/7911 and https://github.com/pydata/xarray/pull/7951).
Those extra pages in the docs aren't released yet, but for now you can view them here (on wrapping numpy-like arrays) and here (on wrapping distributed numpy-like arrays).
In the branch 1031-support-distribution-of-xarray there is now available:
- a class
DXarray
implementing sth similar asxarray
but usingDNDarrays
as underlying objects - balace and resplit for
DXarray
- printing and conversion to and from
xarray.DataArrays
's (still missing: unit tests...)
I will stop here, until we have discussed in the team the "wrapping-approach" proposed by TomNicholas, because such an approach would be much easier to implement (if applicable to Heat).
This issue is stale because it has been open for 60 days with no activity.
This issue was closed because it has been inactive for 60 days since being marked as stale.