xradar Improve parallelization of IO for readers

xradar version: 0.1.0

Description

We should take a look at how we can speed up the xarray backends, and if there are more levels of parallelization possible.

I wonder if upstream enhancements of xarray https://github.com/pydata/xarray/pull/7437

Might help with this, enabling us to plug in the io directly/benefit from more parallelization here.

What I Did

I read the data the following code:

import xarray as xr
import xradar as xd
import numpy as np

def fix_angle(ds):
    """
    Aligns the radar volumes
    """
    ds["time"] = ds.time.load()  # Convert time from dask to numpy

    start_ang = 0  # Set consistent start/end values
    stop_ang = 360

    # Find the median angle resolution
    angle_res = 0.5

    # Determine whether the radar is spinning clockwise or counterclockwise
    median_diff = ds.azimuth.diff("time").median()
    ascending = median_diff > 0
    direction = 1 if ascending else -1

    # first find exact duplicates and remove
    ds = xd.util.remove_duplicate_rays(ds)

    # second reindex according to retrieved parameters
    ds = xd.util.reindex_angle(
        ds, start_ang, stop_ang, angle_res, direction, method="nearest"
    )

    ds = ds.expand_dims("volume_time")  # Expand for volumes for concatenation

    ds["volume_time"] = [np.nanmin(ds.time.values)]

    return ds

ds = xr.open_mfdataset(
    files,
    preprocess=fix_angle,
    engine="cfradial1",
    group="sweep_0",
    concat_dim="volume_time",
    combine="nested",
    chunks={'range':250},
)

Which resulted in this task graph, where the green is the open_dataset function. Screenshot 2023-03-17 at 10 25 50 AM

Which has quite a bit of whitespace/could use some optimization.

Mar 17 '23 15:03 mgrover1

I've tried to reproduce this with some of our ODIM_H5 data with similar outcome.

dask

Mar 17 '23 15:03 kmuehlbauer

@mgrover1 I've tried to track that down now using some GAMIC source data from our BoXPol radar.

In the normal case I get the above shown white spaces in the task graph.

If I remove the additional lines from the gamic open_dataset-function after store_entrypoint.open_dataset:

https://github.com/openradar/xradar/blob/02b2d92248e913f624352ad4692934a25484723e/xradar/io/backends/gamic.py#L468

the call to open_mfdataset returns without triggering any dask-operation.

Only if I .load(), .compute() or otherwise trigger a computation(eg.plotting), the files are accessed and the data is loaded and processed.

That leads to the task graph's as shown below:

One Timestep Single Moment of 15 (time: 12, azimuth: 360, range: 750): dask-gamic-01

All Timesteps Single Moment of 15 (time: 12, azimuth: 360, range: 750): dask-gamic-02

Compute the whole thing: dask-gamic-03

So as a consequence we might need to make sure no immediate dask-computations are triggered before actually doing something with the data. Would it make sense to create a test repo for that?

Mar 19 '23 11:03 kmuehlbauer

Yeah, let's create a test repo to try this out - this is promising! We can take a look at more testing/establishing some benchmarks to dig in here.

Mar 19 '23 20:03 mgrover1

Maybe xradar-benchmark?

Mar 21 '23 13:03 mgrover1