perf(geospatial): optimize grid.intersect() with KD-tree + ConvexHull
Summary
Optimize grid.intersect() performance across all grid types using KD-tree spatial indexing with pre-computed ConvexHull equations and fully vectorized structured grid with searchsorted. This is all on the assumption that grids will never be non-convex and always layered.
Performance Improvements
geospatial_performance_results.csv ~50-300x speedup
Key Optimizations
- KD-tree + vertices indexing: O(log n) candidate cell search using centroids + vertices
- Pre-computed ConvexHull equations: Vectorized point-in-polygon testing via hyperplane equations
- Bounding box rejection: Fast elimination before expensive polygon tests
- Deepcopy fix: Disabled
_copy_cacheduring index build to avoid bottleneck - Vectorized result processing: Eliminated Python loops in grid.intersect() methods
Changes
- Add
flopy/utils/geospatial_index.pywith GeospatialIndex class - Optimize VertexGrid.intersect() - 2D kdtree, 2D convexhull and layer search
- Optimize UnstructuredGrid.intersect()
varies_by_layers=False- 2D kdtree, 2D convexhull and layer search - Optimize UnstructuredGrid.intersect()
varies_by_layers=True- 3D kdtree, 3D convexhull - Optimize StructuredGrid.intersect() with vectorized row/col finding - searchsorted
- Add edge case tests for thin sliver cells, boundaries, corners
- GeospatialIndex now raises ValueError if passed a StructuredGrid
- All methods now return np.nan for outside points (previously StructuredGrid returned (None, nan, nan) with z, and GeospatialIndex.query_point returned None)
- Array queries return int64 when all points inside, float64 when nan values present
- Matches behavior of native VertexGrid.intersect() and UnstructuredGrid.intersect() on develop
- 1:1 point-to-cell mapping
- All methods guarantee same number of outputs as inputs
- Outside points get nan instead of being dropped or raising errors (with forgive=True)
Test coverage
- ~38 tests covering return type consistency, tie-breaking, 3D layered grids, and boundary handling
Regressions
- grids with less than 10 cells or points perform worse due to overhead associated with new methods, but execute in less than 0.5 ms so it doesn't matter.
Codecov Report
:x: Patch coverage is 87.13911% with 49 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 72.7%. Comparing base (556c088) to head (9320199).
:warning: Report is 91 commits behind head on develop.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| flopy/utils/geospatial_index.py | 86.9% | 49 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## develop #2654 +/- ##
===========================================
+ Coverage 55.5% 72.7% +17.1%
===========================================
Files 644 669 +25
Lines 124135 129750 +5615
===========================================
+ Hits 68947 94342 +25395
+ Misses 55188 35408 -19780
| Files with missing lines | Coverage Δ | |
|---|---|---|
| flopy/discretization/grid.py | 75.7% <100.0%> (-0.2%) |
:arrow_down: |
| flopy/discretization/structuredgrid.py | 50.3% <ø> (+2.8%) |
:arrow_up: |
| flopy/discretization/unstructuredgrid.py | 75.6% <ø> (-5.9%) |
:arrow_down: |
| flopy/discretization/vertexgrid.py | 79.2% <ø> (-4.5%) |
:arrow_down: |
| flopy/utils/geospatial_index.py | 86.9% <86.9%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@wpbonelli I’m looking for some guidance on how to handle these tests. The issue arises when a point lies exactly on the vertices of grid cells, making it ambiguous which of the 3-n cells it should belong to:
FAILED test_grid.py::test_intersection - assert 268 == 267
In these non-deterministic cases, different methods return different cell indices. I can force the kdtree-based methods to return the expected value, but that comes with a performance cost. It then causes another test failure (assert DIS == DISV), since those methods also behave differently. I could also update the structured grid path to enforce consistent behavior so all tests pass.
Before making those changes, I’d like your input on the preferred approach to resolve these cascading test failures.
No performance hit with changing tiebreaking
@wpbonelli I’m looking for some guidance on how to handle these tests. The issue arises when a point lies exactly on the vertices of grid cells, making it ambiguous which of the 3-n cells it should belong to:
@aacovski in the gridintersect module, in case of ties, we decided to always return the lowest grid cell number. But I think that any approach as long as it is clearly documented (and maybe consistent across runs) should be fine?
Agree with @dbrakenhoff I think lowest number is the existing convention so might as well stick to it if there is no performance cost?
Great, that's what I did. I'll consider adding more tests for coverage, but otherwise I think the PR is ready. Thanks
across all grid types
grids will never be non-convex and always layered
does the latter mean the new grid index and related performance boost don't support DISU?
does the latter mean the new grid index and related performance boost don't support DISU?
As I understand it, “fully 3D unstructured” grids are more characteristic of FEM models, and I’m not aware of MODFLOW supporting that kind of discretization. That said, my code does support DISU in the sense that varies_by_layers can be set to either True or False... I'm not aware of any FDM or FEM software supporting non-convex elemnts/grids. Please let me know if I'm missing anything.
I am going to clean up docstrings later today as I noticed some old info. Sorry about that. I want to add some more tests too.
@aacovski I guess I just didn't properly understand what you meant by "always layered"
it's maybe worth noting that the index assumes that x/y cell centers are in fact centroid coords?
I hadn’t considered that the provided cell centers might not correspond to geometric centroids even for convex polygons. The index shouldn’t assume that, so I’ll update to clarify how cell centers are interpreted. Thanks for pointing it out!
The index shouldn’t assume that [the provided cell centers correspond to geometric centroids], so I’ll update to clarify how cell centers are interpreted.
Ah, ok- and the index doesn't even assume the center is inside the cell? It just performs better if so?
I hadn’t considered that the provided cell centers might not correspond to geometric centroids even for convex polygons.
I think the intention was/is for it to be cell centroid, even though we don't check this when provided by the user or read from a binary grid file. Evidence:
- https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/mf6/utils/binarygrid_util.py#L327 defines cell centers as centroids, maybe (but in any case suggestively) as a typo
- https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/mf6/utils/binarygrid_util.py#L329
- https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/cvfdutil.py#L17 is used when calculating x/y cell centers in other routines
- https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/cvfdutil.py#L143
- https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/cvfdutil.py#L306
- https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/cvfdutil.py#L369
- https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/utils/geometry.py#L739 is used from
- https://github.com/modflowpy/flopy/blob/e42f51117d1e0db8d9d7eb22eb8eae713a98262e/flopy/discretization/unstructuredgrid.py#L1062
Maybe something to clarify in the MF6IO guide. For DIS it doesn't matter but it becomes relevant for DISV/U.
@aacovski it wasn't really clear from previous comment but I think it's best to revert the docstrings to say centroid rather than leaving it ambiguous. my first comment was really to point out the (unwanted?) ambiguity in the existing grid classes, not to say your new ones should change
Another thought: why not move the structured grid searchsorted query implementation to GeospatialIndex? That way all the Grid.intersect methods could use the index. Since the index receives the grid on init, it knows what kind it is. This seems a bit neater, and also opens the way for cases where a structured grid query might still find the KD tree useful, e.g. ball queries.
Also just want to say, thanks for working on this. This seems like it will be a great asset both in the near term and longer term- in 4.x it could sit behind xarray-based user APIs, whether builtin .sel syntax (point queries) or, for the fancy stuff, maybe an accessor.
I'll work on implementing that: moving intersect to base class and remove the subclass intersect methods. Regarding the docstrings: since there is a difference between centroid and centers, based on everything I've read, it makes more sense to use the more general case (centers) rather than centroids which could be inaccurate.
I'm pretty sure "cell centers" is meant as centroids, that's why I'm suggesting to change the doctrings up in the grids themselves. But I could be wrong. Hopefully someone with more tenure can chime in.
I'm pretty sure "cell centers" is meant as centroids, that's why I'm suggesting to change the doctrings up in the grids themselves. But I could be wrong. Hopefully someone with more tenure can chime in.
For DISV grids the user provides the x and y node coordinates for each cell. In general, this coordinate is the centroid of the cell, however, there is at least one situation where this is not the case. Along the edge of the model domain, there may be half cells (common with Voronoi) where it may be beneficial to assign the x,y cell center coordinate to be on the model domain boundary (and the edge of the cell). The location of a constant-head boundary (or constant concentration), for example, would effectively be assigned to the node coordinate, which a user may want to put on the model edge. This x, y node coordinate is used by DISV (and DISU in some cases) to calculate geometric properties for connections between adjacent cells.
thanks @christianlangevin. so @aacovski the more general "cell centers" is better as you suggest. we may update the description upstream in mf6io but no need to change the grid docstrings here