tobac Minimally invasive workflow for xarray transition

this is a draft PR to document our collaborative activities in the xarray to iris transition. The idea for a step-by-step transition to xarray is laid out in #143 . This is also a test for full collaborative development, in contrast to the earlier strategy with a single feature prepared by one developer. Let's see ....

I suggest:

that the people that work on a certain module indicate what they currently do and also plan to do in a comment to this draft PR, maybe making a checkbox list of already migrated functions within that module

that everybody runs testing locally with the standard tests and the notebooks from the example directory using

tobac> python  -m pytest

tobac> jupyter nbconvert --to notebook --inplace --execute examples/Example_*/*.ipynb

May 07 '23 08:05 fsenf

Codecov Report

Patch coverage: 100.00% and project coverage change: +1.49 :tada:

Comparison is base (a0fd1ff) 47.47% compared to head (7e9b328) 48.96%.

Additional details and impacted files

@@              Coverage Diff              @@
##           RC_v1.5.0     #275      +/-   ##
=============================================
+ Coverage      47.47%   48.96%   +1.49%     
=============================================
  Files             14       14              
  Lines           2711     2851     +140     
=============================================
+ Hits            1287     1396     +109     
- Misses          1424     1455      +31

Flag	Coverage Δ
unittests	`48.96% <100.00%> (+1.49%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
tobac/feature_detection.py	`83.16% <100.00%> (+0.16%)`	:arrow_up:
tobac/testing.py	`96.51% <100.00%> (+0.57%)`	:arrow_up:

... and 2 files with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

May 07 '23 08:05 codecov[bot]

I started work on feature_detection.py.

[ ] feature_detection_multithreshold
[x] feature_detection_multithreshold_timestep
[x] feature_detection_threshold
[x] feature_position
[x] filter_min_distance
[x] remove_parents
[x] test_overlap

May 07 '23 08:05 fsenf

notes: def feature_detection_threshold is using numpy arrays as input type -> this is inconsistent with our interface strategy -> however, I keep it for the moment.

May 07 '23 09:05 fsenf

OK, tests are in place for function feature_detection_multithreshold.

However, they fail because xarray input is not convert into the correct iris.Cube format. .coords is empty:

        if "time" not in [coord.name() for coord in field_in.coords()]:
>           raise ValueError(
                "input to feature detection step must include a dimension named 'time'"
            )
E           ValueError: input to feature detection step must include a dimension named 'time'

Two options:

correct the respective wrapper in utils
proceed and remove iris internals in feature_detection_multithreshold such that .coords() is not used anymore

I favor option 2.

@freemansw1 , @w-k-jones , @JuliaKukulies : What do you think?

May 07 '23 10:05 fsenf

I think this is failing because testing.make_dataset_from_arr doesn't actually set any coordinate names. I'm not even sure why it works in the first place, given that it initially creates an xarray.DataArray and then converts it to iris, but we should probably explicitly set coordinate names.

Overall, I think we need to create a robust xarray/iris conversion method before we convert any of the internals so that we can ensure that output/behaviour is identical before and after the switch. I've starting making a list of elements we need to add on top of the existing xarray conversion methods, which includes:

Coordinate/dim/variable name uniformity. I think that xarray.to_iris uses the standard_name attribute of variables when converting to iris, which can be different to their name in xarray. This isn't symmetric in the other direction either!
Handling times, as it's a mess in iris. Proleptic gregorian calendar nonsense etc.

May 07 '23 14:05 w-k-jones

We are not setting coordinate names in testing.making_dataset_from_arrin the current version, but we are also not testing feature_detection_multithreshold.

In RC_v1.5.0, we are actually setting the coordinate name for the time dimension, but only if the datatype is set to iris (see here):

if has_time:
    out_arr_iris.add_dim_coord(
        iris.coords.DimCoord(
            pd.date_range(start=time_min, periods=time_num)
            .values.astype("datetime64[s]")
            .astype(int),
            standard_name="time",
            units="seconds since epoch",
        ),
        time_dim_num,
)

The tests in the current version do not fail because we have only tested with iris cubes. So agree with @w-k-jones , that we should set the coordinate names when the test dataset output is set to be a xarray.DataArray.

Also agree that it makes sense to start working on more robust conversion routines, and working with the decorators as suggested by @fsenf might also help to identify more of the oddities that are currently covered. Or were you suggesting to fix the conversion routines entirely before continuing with this strategy, @w-k-jones ?

May 08 '23 00:05 JuliaKukulies

@JuliaKukulies : maybe I misunderstood, but we actually have two testing function for feature_detection_multithreshold starting in line

def test_feature_detection_multiple_z_coords(
            )


def test_feature_detection_setting_multiple():
        )

I extended the first test function to also test for xarray and iris as part of the transition to xarray. I will look at the test data soon to find a solution.

May 08 '23 21:05 fsenf

@fsenf, sorry I figured my comment was a bit unclear. I was just saying that these test functions have only been introduced with the 3D changes. Hence, they exist in the RC_v1.5.0 branch, but not in current main. That is not relevant for the changes you made, but was just a note as to why testing.making_dataset_from_arr is not yet equipped with setting the coordinate names for both iris and xarray.

The code snippet I posted was from RC_v1.5.0 and shows why the tests you mention currently work: for iris test data we are explicitly setting the time dimension and we are currently only testing iris. With your changes, we also test xarray which is not covered by the code. I think the easiest for now would be to set the coordinate names for xarray test data sets and then make sure the conversion routine is robust in converting these correctly. Your option 2 would already internally change the code which wanted to stay away for the moment or not?

May 09 '23 13:05 JuliaKukulies

@JuliaKukulies : Thanks for the hint! make_dataset_from_arr is now adjusted to output xarray with coordinates properly!

May 10 '23 15:05 fsenf

Excellent, @fsenf !

May 10 '23 16:05 JuliaKukulies

FWIW, I now have a mostly working copy of feature detection fully transitioned to xarray from iris, using many of the same principles as discussed here, including a migration of add_coordinates to xarray, which turned out to be a more difficult problem than I had imagined! I need to incorporate some of the tests here, but I wanted to pass it along. I'm happy to commit to the exp branch or do a separate PR into either the exp branch or into the RC_v1.5.x branch.

I still have some more testing and validation to do, and some of the folks running tracking on large datasets at CSU have volunteered to do more testing on this. Once I finish my testing, I'll be happy to contribute it back.

Oct 05 '23 03:10 freemansw1

tobac tobac copied to clipboard

Minimally invasive workflow for xarray transition

Codecov Report

tobac
tobac copied to clipboard