flopy icon indicating copy to clipboard operation
flopy copied to clipboard

perf(geospatial): optimize grid.intersect() with KD-tree + ConvexHull

Open aacovski opened this issue 1 month ago • 18 comments

Summary

Optimize grid.intersect() performance across all grid types using KD-tree spatial indexing with pre-computed ConvexHull equations and fully vectorized structured grid with searchsorted. This is all on the assumption that grids will never be non-convex and always layered.

Performance Improvements

geospatial_performance_results.csv ~50-300x speedup

Key Optimizations

  1. KD-tree + vertices indexing: O(log n) candidate cell search using centroids + vertices
  2. Pre-computed ConvexHull equations: Vectorized point-in-polygon testing via hyperplane equations
  3. Bounding box rejection: Fast elimination before expensive polygon tests
  4. Deepcopy fix: Disabled _copy_cache during index build to avoid bottleneck
  5. Vectorized result processing: Eliminated Python loops in grid.intersect() methods

Changes

  • Add flopy/utils/geospatial_index.py with GeospatialIndex class
  • Optimize VertexGrid.intersect() - 2D kdtree, 2D convexhull and layer search
  • Optimize UnstructuredGrid.intersect() varies_by_layers=False - 2D kdtree, 2D convexhull and layer search
  • Optimize UnstructuredGrid.intersect() varies_by_layers=True - 3D kdtree, 3D convexhull
  • Optimize StructuredGrid.intersect() with vectorized row/col finding - searchsorted
  • Add edge case tests for thin sliver cells, boundaries, corners
  • GeospatialIndex now raises ValueError if passed a StructuredGrid
  • All methods now return np.nan for outside points (previously StructuredGrid returned (None, nan, nan) with z, and GeospatialIndex.query_point returned None)
  • Array queries return int64 when all points inside, float64 when nan values present
  • Matches behavior of native VertexGrid.intersect() and UnstructuredGrid.intersect() on develop
  • 1:1 point-to-cell mapping
  • All methods guarantee same number of outputs as inputs
  • Outside points get nan instead of being dropped or raising errors (with forgive=True)

Test coverage

  • ~38 tests covering return type consistency, tie-breaking, 3D layered grids, and boundary handling

Regressions

  • grids with less than 10 cells or points perform worse due to overhead associated with new methods, but execute in less than 0.5 ms so it doesn't matter.

aacovski avatar Nov 23 '25 05:11 aacovski

Codecov Report

:x: Patch coverage is 87.13911% with 49 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 72.7%. Comparing base (556c088) to head (9320199). :warning: Report is 91 commits behind head on develop.

Files with missing lines Patch % Lines
flopy/utils/geospatial_index.py 86.9% 49 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2654      +/-   ##
===========================================
+ Coverage     55.5%    72.7%   +17.1%     
===========================================
  Files          644      669      +25     
  Lines       124135   129750    +5615     
===========================================
+ Hits         68947    94342   +25395     
+ Misses       55188    35408   -19780     
Files with missing lines Coverage Δ
flopy/discretization/grid.py 75.7% <100.0%> (-0.2%) :arrow_down:
flopy/discretization/structuredgrid.py 50.3% <ø> (+2.8%) :arrow_up:
flopy/discretization/unstructuredgrid.py 75.6% <ø> (-5.9%) :arrow_down:
flopy/discretization/vertexgrid.py 79.2% <ø> (-4.5%) :arrow_down:
flopy/utils/geospatial_index.py 86.9% <86.9%> (ø)

... and 556 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 23 '25 23:11 codecov[bot]

@wpbonelli I’m looking for some guidance on how to handle these tests. The issue arises when a point lies exactly on the vertices of grid cells, making it ambiguous which of the 3-n cells it should belong to:

FAILED test_grid.py::test_intersection - assert 268 == 267

In these non-deterministic cases, different methods return different cell indices. I can force the kdtree-based methods to return the expected value, but that comes with a performance cost. It then causes another test failure (assert DIS == DISV), since those methods also behave differently. I could also update the structured grid path to enforce consistent behavior so all tests pass.

Before making those changes, I’d like your input on the preferred approach to resolve these cascading test failures.

aacovski avatar Nov 25 '25 19:11 aacovski

No performance hit with changing tiebreaking

aacovski avatar Nov 26 '25 08:11 aacovski

@wpbonelli I’m looking for some guidance on how to handle these tests. The issue arises when a point lies exactly on the vertices of grid cells, making it ambiguous which of the 3-n cells it should belong to:

@aacovski in the gridintersect module, in case of ties, we decided to always return the lowest grid cell number. But I think that any approach as long as it is clearly documented (and maybe consistent across runs) should be fine?

dbrakenhoff avatar Nov 26 '25 09:11 dbrakenhoff

Agree with @dbrakenhoff I think lowest number is the existing convention so might as well stick to it if there is no performance cost?

wpbonelli avatar Nov 26 '25 11:11 wpbonelli

Great, that's what I did. I'll consider adding more tests for coverage, but otherwise I think the PR is ready. Thanks

aacovski avatar Nov 26 '25 14:11 aacovski

across all grid types

grids will never be non-convex and always layered

does the latter mean the new grid index and related performance boost don't support DISU?

wpbonelli avatar Nov 26 '25 15:11 wpbonelli

does the latter mean the new grid index and related performance boost don't support DISU?

As I understand it, “fully 3D unstructured” grids are more characteristic of FEM models, and I’m not aware of MODFLOW supporting that kind of discretization. That said, my code does support DISU in the sense that varies_by_layers can be set to either True or False... I'm not aware of any FDM or FEM software supporting non-convex elemnts/grids. Please let me know if I'm missing anything.

aacovski avatar Nov 26 '25 15:11 aacovski

I am going to clean up docstrings later today as I noticed some old info. Sorry about that. I want to add some more tests too.

aacovski avatar Nov 26 '25 16:11 aacovski

@aacovski I guess I just didn't properly understand what you meant by "always layered"

wpbonelli avatar Nov 28 '25 17:11 wpbonelli

it's maybe worth noting that the index assumes that x/y cell centers are in fact centroid coords?

I hadn’t considered that the provided cell centers might not correspond to geometric centroids even for convex polygons. The index shouldn’t assume that, so I’ll update to clarify how cell centers are interpreted. Thanks for pointing it out!

aacovski avatar Nov 29 '25 03:11 aacovski

The index shouldn’t assume that [the provided cell centers correspond to geometric centroids], so I’ll update to clarify how cell centers are interpreted.

Ah, ok- and the index doesn't even assume the center is inside the cell? It just performs better if so?

I hadn’t considered that the provided cell centers might not correspond to geometric centroids even for convex polygons.

I think the intention was/is for it to be cell centroid, even though we don't check this when provided by the user or read from a binary grid file. Evidence:

  • https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/mf6/utils/binarygrid_util.py#L327 defines cell centers as centroids, maybe (but in any case suggestively) as a typo
    • https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/mf6/utils/binarygrid_util.py#L329
  • https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/cvfdutil.py#L17 is used when calculating x/y cell centers in other routines
    • https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/cvfdutil.py#L143
    • https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/cvfdutil.py#L306
    • https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/cvfdutil.py#L369
  • https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/geometry.py#L739 is used from
    • https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/discretization/unstructuredgrid.py#L1062

Maybe something to clarify in the MF6IO guide. For DIS it doesn't matter but it becomes relevant for DISV/U.

wpbonelli avatar Dec 01 '25 14:12 wpbonelli

@aacovski it wasn't really clear from previous comment but I think it's best to revert the docstrings to say centroid rather than leaving it ambiguous. my first comment was really to point out the (unwanted?) ambiguity in the existing grid classes, not to say your new ones should change

wpbonelli avatar Dec 02 '25 11:12 wpbonelli

Another thought: why not move the structured grid searchsorted query implementation to GeospatialIndex? That way all the Grid.intersect methods could use the index. Since the index receives the grid on init, it knows what kind it is. This seems a bit neater, and also opens the way for cases where a structured grid query might still find the KD tree useful, e.g. ball queries.

Also just want to say, thanks for working on this. This seems like it will be a great asset both in the near term and longer term- in 4.x it could sit behind xarray-based user APIs, whether builtin .sel syntax (point queries) or, for the fancy stuff, maybe an accessor.

wpbonelli avatar Dec 02 '25 21:12 wpbonelli

I'll work on implementing that: moving intersect to base class and remove the subclass intersect methods. Regarding the docstrings: since there is a difference between centroid and centers, based on everything I've read, it makes more sense to use the more general case (centers) rather than centroids which could be inaccurate.

aacovski avatar Dec 02 '25 23:12 aacovski

I'm pretty sure "cell centers" is meant as centroids, that's why I'm suggesting to change the doctrings up in the grids themselves. But I could be wrong. Hopefully someone with more tenure can chime in.

wpbonelli avatar Dec 03 '25 00:12 wpbonelli

I'm pretty sure "cell centers" is meant as centroids, that's why I'm suggesting to change the doctrings up in the grids themselves. But I could be wrong. Hopefully someone with more tenure can chime in.

For DISV grids the user provides the x and y node coordinates for each cell. In general, this coordinate is the centroid of the cell, however, there is at least one situation where this is not the case. Along the edge of the model domain, there may be half cells (common with Voronoi) where it may be beneficial to assign the x,y cell center coordinate to be on the model domain boundary (and the edge of the cell). The location of a constant-head boundary (or constant concentration), for example, would effectively be assigned to the node coordinate, which a user may want to put on the model edge. This x, y node coordinate is used by DISV (and DISU in some cases) to calculate geometric properties for connections between adjacent cells.

christianlangevin avatar Dec 03 '25 13:12 christianlangevin

thanks @christianlangevin. so @aacovski the more general "cell centers" is better as you suggest. we may update the description upstream in mf6io but no need to change the grid docstrings here

wpbonelli avatar Dec 03 '25 14:12 wpbonelli