tobac icon indicating copy to clipboard operation
tobac copied to clipboard

Switch feature detection to use xarray internally

Open freemansw1 opened this issue 1 year ago • 19 comments

This is a draft PR that contains my work switching feature detection to use xarray internally, using many of the concepts (and some of the code) from #275. This code uses entirely xarray and pandas for processing feature detection, and removes all iris calls; instead using our internal conversion utilities to convert input iris data to xarray (which should always work, unlike xarray->iris. Further, this change should allow users to provide their data as an xarray.DataArray.

This PR will remain in draft form until at least #293 is merged and intentionally includes those changes.

More tests (particularly around times on different OS versions) should be added, but I wanted to get this into a form that we could discuss.

  • [x] Have you followed our guidelines in CONTRIBUTING.md?
  • [x] Have you self-reviewed your code and corrected any misspellings?
  • [x] Have you written documentation that is easy to understand?
  • [x] Have you written descriptive commit messages?
  • [x] Have you added NumPy docstrings for newly added functions?
  • [x] Have you formatted your code using black?
  • [x] If you have introduced a new functionality, have you added adequate unit tests?
  • [x] Have all tests passed in your local clone?
  • [ ] If you have introduced a new functionality, have you added an example notebook?
  • [ ] Have you kept your pull request small and limited so that it is easy to review?
  • [x] Have the newest changes from this branch been merged?

freemansw1 avatar Oct 11 '23 15:10 freemansw1

Codecov Report

Attention: Patch coverage is 78.20513% with 68 lines in your changes missing coverage. Please review.

Project coverage is 62.27%. Comparing base (5f82097) to head (bbb6b67).

Files Patch % Lines
tobac/utils/internal/iris_utils.py 63.63% 52 Missing :warning:
tobac/utils/internal/xarray_utils.py 88.88% 14 Missing :warning:
tobac/utils/general.py 85.71% 2 Missing :warning:
Additional details and impacted files
@@              Coverage Diff              @@
##           RC_v1.6.0     #354      +/-   ##
=============================================
+ Coverage      60.91%   62.27%   +1.36%     
=============================================
  Files             23       23              
  Lines           3541     3690     +149     
=============================================
+ Hits            2157     2298     +141     
- Misses          1384     1392       +8     
Flag Coverage Δ
unittests 62.27% <78.20%> (+1.36%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Nov 09 '23 22:11 codecov[bot]

Okay, this is now ready for review in my opinion. This PR contains changes from #380.

I'd like feedback (and testing) everywhere, but in particular, I'd like feedback on the new add_coordinates_to_features function in tobac/utils/internal/xarray_utils.py. I had to come up with a clever solution, and I really hate clever solutions.

freemansw1 avatar Dec 04 '23 22:12 freemansw1

Okay, this is now ready for review in my opinion. This PR contains changes from #380.

I'd like feedback (and testing) everywhere, but in particular, I'd like feedback on the new add_coordinates_to_features function in tobac/utils/internal/xarray_utils.py. I had to come up with a clever solution, and I really hate clever solutions.

Aha, multidimensional interpolation with xarray is a massive pain...

w-k-jones avatar Dec 04 '23 22:12 w-k-jones

Okay, this is now ready for review in my opinion. This PR contains changes from #380. I'd like feedback (and testing) everywhere, but in particular, I'd like feedback on the new add_coordinates_to_features function in tobac/utils/internal/xarray_utils.py. I had to come up with a clever solution, and I really hate clever solutions.

Aha, multidimensional interpolation with xarray is a massive pain...

The issue actually came about because we have i,j,k coordinates, and there doesn't appear to be a way to interpolate along those coordinates if a coord and dim share a name. The multidimensional aspect actually wasn't too bad as it only has to interpolate a single point at a time.

freemansw1 avatar Dec 04 '23 22:12 freemansw1

Okay, this is now ready for review in my opinion. This PR contains changes from #380. I'd like feedback (and testing) everywhere, but in particular, I'd like feedback on the new add_coordinates_to_features function in tobac/utils/internal/xarray_utils.py. I had to come up with a clever solution, and I really hate clever solutions.

Aha, multidimensional interpolation with xarray is a massive pain...

The issue actually came about because we have i,j,k coordinates, and there doesn't appear to be a way to interpolate along those coordinates if a coord and dim share a name. The multidimensional aspect actually wasn't too bad as it only has to interpolate a single point at a time.

Ah I see what you mean, it's because of interpolating based on array index location rather than coord value

w-k-jones avatar Dec 04 '23 22:12 w-k-jones

Could this be solved using DataArray.swap_dims?:

variable_da = variable_da.swap_dims({"time":"frame", "south_north":"hdim_1", "west_east":"hdim_2"})
variable_da.interp(
    coords=dict(
        frame=xr.DataArray(Features["frame"].values, dims="feature"),
        hdim_1=xr.DataArray(Features["hdim_1"].values, dims="feature"),
        hdim_2=xr.DataArray(Features["hdim_2"].values, dims="feature"),
    )
)

Suggestion was found via this still open xarray issue from 4 years ago... https://github.com/pydata/xarray/issues/3343

w-k-jones avatar Dec 04 '23 22:12 w-k-jones

Could this be solved using DataArray.swap_dims?:

variable_da = variable_da.swap_dims({"time":"frame", "south_north":"hdim_1", "west_east":"hdim_2"})
variable_da.interp(
    coords=dict(
        frame=xr.DataArray(Features["frame"].values, dims="feature"),
        hdim_1=xr.DataArray(Features["hdim_1"].values, dims="feature"),
        hdim_2=xr.DataArray(Features["hdim_2"].values, dims="feature"),
    )
)

Suggestion was found via this still open xarray issue from 4 years ago... pydata/xarray#3343

Yes, good catch. I hadn't seen that; I came up with the same pathway, but was doing a DataArray->Dataset conversion first because I thought it would be named something sensible like rename_dims. I've revised this now.

freemansw1 avatar Dec 05 '23 15:12 freemansw1

Linting results by Pylint:

Your code has been rated at 8.70/10 (previous run: 8.70/10, +0.00) The linting score is an indicator that reflects how well your code version follows Pylint’s coding standards and quality metrics with respect to the RC_v1.6.0 branch. A decrease usually indicates your new code does not fully meet style guidelines or has potential errors.

github-actions[bot] avatar Dec 05 '23 15:12 github-actions[bot]

I have started my review on this, along with #380 , but I think it would be good to discuss whether we want to do refactoring during the switch to xarray or afterwards. In particular I think we need to make sure xarray_utils is in the best state possible as it will be important for refactoring segmentation and tracking, but I'm not sure how much the current implementation here is limited by needing to match iris_utils capability.

I'm also not sure that conversion of datetime etc inside feature detection is the best way to go. In the current implementation the converted datetime to cftime doesn't match to iris input, as it doesn't have information about the calendar. I think it may be easier and clearer to do the conversion in the decorator, and keep the function unaware of whether the input data is being converted or not, but there may be cases were the other way around is easier. Will be good to discuss this tomorrow!

w-k-jones avatar Dec 07 '23 21:12 w-k-jones

Note to self: this depends on #380 to merge.

freemansw1 avatar Feb 23 '24 02:02 freemansw1

Got about halfway through the review comments... Next few will take a bit longer.

freemansw1 avatar Feb 23 '24 03:02 freemansw1

OK, I have resolved all review comments here. @w-k-jones ready for a second review. We will need another review from someone else, too, but this is not urgent as I need to get #380 resolved first.

freemansw1 avatar Feb 28 '24 02:02 freemansw1

Great, thanks for making the revisions. I've had a look through the changes and tested the code with both xarray and iris locally and have no issues to issues. Should we start an RC_v1.6.0 branch to merge this into?

Yes, we will need to do that. I'll wait until we have to though to minimize the amount of time we have to maintain separate branches. We still need one more reviewer, @JuliaKukulies or @kelcyno are you up for it?

freemansw1 avatar Mar 13 '24 15:03 freemansw1

Great, thanks for making the revisions. I've had a look through the changes and tested the code with both xarray and iris locally and have no issues to issues. Should we start an RC_v1.6.0 branch to merge this into?

Yes, we will need to do that. I'll wait until we have to though to minimize the amount of time we have to maintain separate branches. We still need one more reviewer, @JuliaKukulies or @kelcyno are you up for it?

I can do the second review within the next couple of days!

I also started a RC_v1.6.0 branch to merge this PR into. We can wait until the release of v1.5.3 to update the branch before this PR is merged. But I will try to review asap :)

JuliaKukulies avatar Mar 22 '24 14:03 JuliaKukulies

Ok, @JuliaKukulies, I think I've addressed your concerns. Ready for your re-review. Hopefully we can get this merged!

freemansw1 avatar May 09 '24 21:05 freemansw1

Excellent, @freemansw1! This looks good except for that there is still the issue with preserve_iris_datetime_type which does not result in the same datetime format as when I provide iris input when this parameters is True (see my last comment).

Sorry I had not marked the bug at the actual code locations, so I get that you might have overseen this. I also cannot really figure out where this bug comes from, since everywhere in the code where preserve_iris_datetime_type is used looks like it should work. Could you reproduce this or is this an issue with my test data?

JuliaKukulies avatar May 10 '24 00:05 JuliaKukulies

Excellent, @freemansw1! This looks good except for that there is still the issue with preserve_iris_datetime_type which does not result in the same datetime format as when I provide iris input when this parameters is True (see my last comment).

Sorry I had not marked the bug at the actual code locations, so I get that you might have overseen this. I also cannot really figure out where this bug comes from, since everywhere in the code where preserve_iris_datetime_type is used looks like it should work. Could you reproduce this or is this an issue with my test data?

Apologies, yeah, I had missed this. I will work to reproduce and try this out this weekend/early next week.

freemansw1 avatar May 13 '24 17:05 freemansw1

I think I've noticed this before, and the issue is to do with calculating the timesteps from cftime format. I think that in the current versions of tobac the time is also returned with an object dtype, but I would be happier to change it to datetime64 in all cases as I currently have to do this for most use cases...

w-k-jones avatar May 13 '24 23:05 w-k-jones

Note to self: add_coords for xarray fails when having coordinates to interpolate along dimensions other than x/y.

freemansw1 avatar Jun 03 '24 02:06 freemansw1

OK, I think I have resolved the datetime issues. It's somewhat of a hacky solution, but basically, xarray only converts to datetime64 if the time received is in Gregorian time. I now check for this and convert back to Gregorian time. For all other calendars (thanks @harrietgilmour for helping with this part), xarray leaves it as an object (which does still break our code, but alas).

freemansw1 avatar Jul 17 '24 20:07 freemansw1

Note to self: add_coords for xarray fails when having coordinates to interpolate along dimensions other than x/y.

I have now also resolved this issue. The solution here results in some decisions that were contrary to the discussion we had at the last dev meeting (notably that len(1) coordinates are still appended to the dataframe), but I think we are otherwise okay. We can talk about this at the dev meeting tomorrow.

freemansw1 avatar Jul 18 '24 17:07 freemansw1

@w-k-jones and @JuliaKukulies I am now ready for a re-review.

freemansw1 avatar Jul 18 '24 17:07 freemansw1

OK, I think I have resolved the datetime issues. It's somewhat of a hacky solution, but basically, xarray only converts to datetime64 if the time received is in Gregorian time. I now check for this and convert back to Gregorian time. For all other calendars (thanks @harrietgilmour for helping with this part), xarray leaves it as an object (which does still break our code, but alas).

OK, I have tested this again with my data and I thought I understood what you explained in the dev meeting today, but now i am a little confused again :) I basically get:

  • for xarray input and preserve_iris_datetime_type = True -> datetime64

  • for xarray input and preserve_iris_datetime_type = False -> datetime64

  • for iris input and preserve_iris_datetime_type = True -> object (cftime.DatetimeGregorian)

  • for iris input and preserve_iris_datetime_type = False -> object (cftime.DatetimeGregorian)

Is that what we want?

JuliaKukulies avatar Jul 19 '24 19:07 JuliaKukulies

OK, I think I have resolved the datetime issues. It's somewhat of a hacky solution, but basically, xarray only converts to datetime64 if the time received is in Gregorian time. I now check for this and convert back to Gregorian time. For all other calendars (thanks @harrietgilmour for helping with this part), xarray leaves it as an object (which does still break our code, but alas).

OK, I have tested this again with my data and I thought I understood what you explained in the dev meeting today, but now i am a little confused again :) I basically get:

  • for xarray input and preserve_iris_datetime_type = True -> datetime64
  • for xarray input and preserve_iris_datetime_type = False -> datetime64
  • for iris input and preserve_iris_datetime_type = True -> object (cftime.DatetimeGregorian)
  • for iris input and preserve_iris_datetime_type = False -> object (cftime.DatetimeGregorian)

Is that what we want?

Ok, the first of those three are all correct, but the last one isn't.

freemansw1 avatar Jul 22 '24 14:07 freemansw1

OK, I think I have resolved the datetime issues. It's somewhat of a hacky solution, but basically, xarray only converts to datetime64 if the time received is in Gregorian time. I now check for this and convert back to Gregorian time. For all other calendars (thanks @harrietgilmour for helping with this part), xarray leaves it as an object (which does still break our code, but alas).

OK, I have tested this again with my data and I thought I understood what you explained in the dev meeting today, but now i am a little confused again :) I basically get:

  • for xarray input and preserve_iris_datetime_type = True -> datetime64
  • for xarray input and preserve_iris_datetime_type = False -> datetime64
  • for iris input and preserve_iris_datetime_type = True -> object (cftime.DatetimeGregorian)
  • for iris input and preserve_iris_datetime_type = False -> object (cftime.DatetimeGregorian)

Is that what we want?

Ok, the first of those three are all correct, but the last one isn't.

Thats what I thought, and we would expect datetime64 for the last one, right? Have you tested this with the data I sent you or another dataset? Sorry to cause more struggle..

JuliaKukulies avatar Jul 22 '24 14:07 JuliaKukulies

OK, I think I have resolved the datetime issues. It's somewhat of a hacky solution, but basically, xarray only converts to datetime64 if the time received is in Gregorian time. I now check for this and convert back to Gregorian time. For all other calendars (thanks @harrietgilmour for helping with this part), xarray leaves it as an object (which does still break our code, but alas).

OK, I have tested this again with my data and I thought I understood what you explained in the dev meeting today, but now i am a little confused again :) I basically get:

  • for xarray input and preserve_iris_datetime_type = True -> datetime64
  • for xarray input and preserve_iris_datetime_type = False -> datetime64
  • for iris input and preserve_iris_datetime_type = True -> object (cftime.DatetimeGregorian)
  • for iris input and preserve_iris_datetime_type = False -> object (cftime.DatetimeGregorian)

Is that what we want?

Ok, the first of those three are all correct, but the last one isn't.

Thats what I thought, and we would expect datetime64 for the last one, right? Have you tested this with the data I sent you or another dataset? Sorry to cause more struggle..

Ok this one was easy- we didn't actually expect the kwarg in feature_detection_multithreshold. I've fixed that.

freemansw1 avatar Jul 23 '24 14:07 freemansw1

OK, I think I have resolved the datetime issues. It's somewhat of a hacky solution, but basically, xarray only converts to datetime64 if the time received is in Gregorian time. I now check for this and convert back to Gregorian time. For all other calendars (thanks @harrietgilmour for helping with this part), xarray leaves it as an object (which does still break our code, but alas).

OK, I have tested this again with my data and I thought I understood what you explained in the dev meeting today, but now i am a little confused again :) I basically get:

  • for xarray input and preserve_iris_datetime_type = True -> datetime64
  • for xarray input and preserve_iris_datetime_type = False -> datetime64
  • for iris input and preserve_iris_datetime_type = True -> object (cftime.DatetimeGregorian)
  • for iris input and preserve_iris_datetime_type = False -> object (cftime.DatetimeGregorian)

Is that what we want?

Ok, the first of those three are all correct, but the last one isn't.

Thats what I thought, and we would expect datetime64 for the last one, right? Have you tested this with the data I sent you or another dataset? Sorry to cause more struggle..

Ok this one was easy- we didn't actually expect the kwarg in feature_detection_multithreshold. I've fixed that.

Perfect, thanks for looking into this and fixing this. I am happy to approve this now once the latest changes from RC_v1.5.x are merged in

JuliaKukulies avatar Jul 24 '24 21:07 JuliaKukulies

OK, changes all merged in and everything. @JuliaKukulies and @w-k-jones are you happy for me to merge into RC_v1.6.0?

freemansw1 avatar Jul 25 '24 13:07 freemansw1

OK, changes all merged in and everything. @JuliaKukulies and @w-k-jones are you happy for me to merge into RC_v1.6.0?

Yes, please go ahead:) !

JuliaKukulies avatar Jul 25 '24 14:07 JuliaKukulies

Also happy for this to be merged :)

w-k-jones avatar Jul 25 '24 14:07 w-k-jones