tobac icon indicating copy to clipboard operation
tobac copied to clipboard

Add PBC capability to `merge_split_MEST`

Open w-k-jones opened this issue 1 year ago • 6 comments

This turned into a general refactor of merge_split_MEST, with the following changes and additions:

  1. Adds PBC capabilities, using the same BallTree distance search approach as used in feature detection.
  2. Moves filtering by frame_len before calculating the minimum spanning tree, improving the number of merging/splitting cells linked particularly for long tracking time periods.
  3. Changes the cell merging logic to use scipy.sparse.csgraph.connected_components, improving performance
  4. Made sure all output arrays are int dtype, and started the track id from 1 rather than 0

I'm also going to look into adding a flag for whether a cell started with a split or ended with a merge, and which object it was merged into/split from

  • [x] Have you followed our guidelines in CONTRIBUTING.md?
  • [x] Have you self-reviewed your code and corrected any misspellings?
  • [x] Have you written documentation that is easy to understand?
  • [x] Have you written descriptive commit messages?
  • [x] Have you added NumPy docstrings for newly added functions?
  • [x] Have you formatted your code using black?
  • [x] If you have introduced a new functionality, have you added adequate unit tests?
  • [x] Have all tests passed in your local clone?
  • [x] If you have introduced a new functionality, have you added an example notebook?
  • [x] Have you kept your pull request small and limited so that it is easy to review?
  • [x] Have the newest changes from this branch been merged?

w-k-jones avatar Nov 29 '23 13:11 w-k-jones

Keeping as draft until #368 is merged as this contains the same commits

w-k-jones avatar Nov 29 '23 13:11 w-k-jones

Oh this looks really interesting @w-k-jones I'll get started on it!

kelcyno avatar Nov 29 '23 17:11 kelcyno

Ok, I ended up going down a bit of a rabbit hole there. I found that because the filtering for frame_len was carried out after the minimum euclidean spanning tree calculation, valid links were being removed because a spatially closer cell at another point in time was being removed and therefore removing the edges between two neighbouring cells. This had more of an impact the longer the time period being tracked. When I swapped to filtering by frame_len before calculating the MEST it massively increased the number of linked cells, which then caused the rest of the merge/split routine to run slower. I've been experimenting with scipy.sparse.csgraph.connected_components recently, and this seemed like a really good application for linking the neighbouring cells into connected tracks

w-k-jones avatar Nov 29 '23 17:11 w-k-jones

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (53df42d) 56.79% compared to head (e9ec119) 56.61%. Report is 43 commits behind head on RC_v1.5.x.

Additional details and impacted files
@@              Coverage Diff              @@
##           RC_v1.5.x     #372      +/-   ##
=============================================
- Coverage      56.79%   56.61%   -0.18%     
=============================================
  Files             20       20              
  Lines           3435     3407      -28     
=============================================
- Hits            1951     1929      -22     
+ Misses          1484     1478       -6     
Flag Coverage Δ
unittests 56.61% <100.00%> (-0.18%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Nov 29 '23 21:11 codecov[bot]

I haven't started reviewing this yet, but I wonder if we shouldn't try to push this to v1.5.3? I'm starting to get antsy about v1.5.2 coming out.

freemansw1 avatar Nov 29 '23 21:11 freemansw1

Reminder to myself: currently the PBC handling requires all of h1_min, h1_max, h2_min and h2_max to be provided even if PBCs are calculated over only one dimension. I should fix this to use default values for the dimension not being used before merge

w-k-jones avatar Mar 09 '24 00:03 w-k-jones

Oh - per our last Tobac meeting we talked about if Will's changes needed to be a separate merge/split method - it does not need to be a separate method.

kelcyno avatar Jul 19 '24 05:07 kelcyno

Oh - per our last Tobac meeting we talked about if Will's changes needed to be a separate merge/split method - it does not need to be a separate method.

@w-k-jones @kelcyno I tested this with some model data over CONUS and found that Will's merge/split method results in a significantly lower number (about a 5th) of unique track IDs (basically the variables feature_parent_track_id and cell_parent_track_id). Not sure if this is a bug but probably not intended behaviour?

Is the assignment of track IDs here really the same to what Kelcy does in the long loop?

https://github.com/w-k-jones/tobac/blob/e9ec119b45becbcf0057cfc11114dd71fe3e33ab/tobac/merge_split.py#L205-L213

@w-k-jones I tested this with the MCS criteria for our DYAMOND project and because the number of unique track IDs is so different using the modified merge split function, this results also in a very different number of MCSs if the criteria are applied to the clusters belonging to a track.

JuliaKukulies avatar Jul 31 '24 00:07 JuliaKukulies

@JuliaKukulies I believe the change in the results is due to a bug I fixed in the old version: because the minimum spanning tree was applied before the time filter it would result in cell starts/ends which were close together in x/y coords but far apart from time being identified as neighbours, but then subsequently trimmed due to the time gap. This could mean that other cells which were within the time range for merging but further apart in x/y would not be linked. I changed the order in which these operations were applied to avoid this issue, which will result in more cell merges and a smaller number of unique tracks. This wasn’t found originally as the method was designed for short time periods, but became apparent during MCSMIP due to the long time period. Could you test running your data over different length time slices (e.g. 1 hour, 2 hour, 3 hours etc) and see if the proportion of tracks/cells remains roughly the same for the new version but increases (i.e. more tracks and fewer cells per track) with the old version?

w-k-jones avatar Aug 28 '24 17:08 w-k-jones

@JuliaKukulies I believe the change in the results is due to a bug I fixed in the old version: because the minimum spanning tree was applied before the time filter it would result in cell starts/ends which were close together in x/y coords but far apart from time being identified as neighbours, but then subsequently trimmed due to the time gap. This could mean that other cells which were within the time range for merging but further apart in x/y would not be linked. I changed the order in which these operations were applied to avoid this issue, which will result in more cell merges and a smaller number of unique tracks. This wasn’t found originally as the method was designed for short time periods, but became apparent during MCSMIP due to the long time period. Could you test running your data over different length time slices (e.g. 1 hour, 2 hour, 3 hours etc) and see if the proportion of tracks/cells remains roughly the same for the new version but increases (i.e. more tracks and fewer cells per track) with the old version?

@w-k-jones Thanks for this clarification! That makes a lot of sense. I ran tests with different time slices and you are absolutely right. The number of unique tracks and their proportion to cells increases in the old version but the proportion stays about the same in your updated version. The difference in unique tracks is not very high in the first hours and even days, but then becomes quite high when running the merging and splitting module for tracks of a whole month.

JuliaKukulies avatar Aug 31 '24 01:08 JuliaKukulies