icepyx
icepyx copied to clipboard
H5Coro testing and integration
Opening this to start discussion and development on the H5coro integration piece. Based on discussions with Rachel and Eric, we have several options for implementing h5coro into icepyx. However, it might require significant restructuring of the Read module in order to support the h5coro backend and (potentially) Xarray DataTrees.
:point_left: Launch a binder notebook on this branch for commit 245fcace5ece06332cd8b88e8db8626e6b612c0a
I will automatically update this comment whenever this PR is modified
:point_left: Launch a binder notebook on this branch for commit 83ae9a43953b3904f83165d32802a5419559bc66
:point_left: Launch a binder notebook on this branch for commit 237871a72f3d4a38acd4c595b047757e4fbc0533
:point_left: Launch a binder notebook on this branch for commit 49ab30cce40011fd3b4a6e7ca7f3b91d7010a966
:point_left: Launch a binder notebook on this branch for commit 664d5173fe6923c0522a20e6dbbc41b71ca01dd7
Exciting to see the progress, thanks for posting @zachghiaccio!
One question - does the Quest module have to change in order to use h5coro? I see several modified Quest files, which surprised me because I wouldn't have thought they'd need to change.
I am noticing that there is one commit on this branch from September 2023. Maybe that is the source of the Quest file modifications? (and perhaps also the merge conflicts?)
Yeah I'm a bit confused on those Quest changes as well, because I don't think there's anything in Quest right now that requires h5coro. The changes in said commit are already implemented into icepyx, so I must have forgotten to pull those changes.
Hey everyone! I played around with this a bit this afternoon. @zachghiaccio I removed everything from this commit that was related to quest, since it sounds like the h5coro work isn't related to the quest activitiy.
In terms of h5coro I'm starting to develop a short list of things that need to be done to allow for h5coro to be added to icepyx:
- specify
engine=h5coroin thexr.open_dataset()calls (@zachghiaccio did this one already) - h5coro doesn't want to read an fsspec object, it wants a string to an s3 filepath (without the
s3://at the beginning) - because of the previous point h5coro also needs the s3 access token to be specified using the
credentialsargument - implement
phony_dimskwarg in h5coro
The first 3 pieces above can be completed for the moment pretty easily (I made these changes in a fast way in the most recent commit). The bigger bump is point no. 4. Icepyx relies on the phony_dims: access kwarg on the xr.open_dataset() function, which isn't implemented in h5coro. I don't understand well what this kwarg does, but it is appears to be related to smushing together datasets of differing dimensions. If you try to read data in icepyx without that kwarg an error is returned:
ValueError: conflicting sizes for dimension 'crossing_time': length 1 on 'cycle_number' and length 15 on {'crossing_time': 'bounding_polygon_lat1'}
While I didn't get all the way to the end of implementing h5coro in icepyx, I did notice the reoccuring theme:
the amount that h5coro can speed up data reading in icepyx is limited by the number of times that icepyx splits up the reads
https://github.com/icesat2py/icepyx/discussions/676 goes into more depth on some profiling to find that reading a single beam of data creates three separate read requests (three groups). As a parallelization library h5coro would provide the most benefit if we could consolidate our reads into one and let h5coro do the work of splitting that up. All interesting points to consider in the broader conversation about what the Read module in icepyx should look like.
Sorry for the delayed response, but thanks for looking into all of this @rwegener2 ! I'm juggling two science team meetings today, but I'll finally have a bit of time this afternoon to check out your additions, and maybe look into the phony_dims kwarg, since I want to understand that better as well.