jts
jts copied to clipboard
RFC: Fixing invalid geometry
It is desirable to have a way to convert an invalid geometry into a valid one that preserves the character of the input (to some extent). This allows the fixed geometry to be used in JTS operations.
See GeometryFixer prototype code.
Goals
- Allow converting any geometry to one which is valid, preserving as much as possible of the original appearance
Non-Goals
- Preserving all vertices in input - some vertices may be dropped to produce a "natural" repair
- Precision reduction - can be done subsequent to processing
- Normalization - can be done subsequent to processing
- Simplification - can be done before processing
Motivating Cases
Descriptions use following terminology: "touch" means touches in a point, "adjacent" means touches along a line.
See XML file for example data test cases.
- Bow-tie polygon
- Bow-tie polygon with multiple crossings
- Checkerboard polygon
- Polygon with zero area and collapses to a line
- Polygon with zero area and collapses to a point
- Polygon with portion of the shell boundary which collapses to a line (internal or external)
- Polygon with portion of the shell boundary which collapses to a line and encloses area
- Polygon which overlaps itself
- Polygon with linework separating it into two adjacent regions (i.e cutting it in two)
- Polygon with collapsed linework that encloses area (i.e. with an inverted boundary that is adjacent)
- Polygon with overlapping holes (intersecting in points)
- Polygon with overlapping holes (intersecting along edges)
- Polygon with adjacent holes
- Polygon with holes forming a separated region
- Polygon with hole adjacent to shell
- Polygon with holes adjacent to shell and separating a region
- Polygon with hole overlapping shell
- Polygon with hole surrounding shell
- Polygon with bowtie hole overlapping shell - intersection inside
- Polygon with bowtie hole overlapping shell - intersection outside
- Polygon with nested holes - disjoint
- Polygon with nested holes - touching
- Polygon with nested holes - adjacent
- MultiPolygon with overlapping elements
- MultiPolygon with adjacent elements
- MultiPolygon with polygon with hole overlapping shell and other polygon
- GeometryCollection with multiple invalid elements
- Line that collapses to a point
- Geometry with NaN ordinates
Issues
- In PostGIS the semantic is to preserve all input vertices if possible. To do this, collapsed edges in polygonal inputs may result in output containing LineStrings. Is this desirable? Perhaps should be an alternate mode?
Semantics
- Result geometry is always a new object
- Validity is not checked - input geometry is always processed
- A validity check can be performed by client, if needed
- Valid inputs will still have minor transformations applied
- Empty atomic geometries are valid and are returned unchanged (copied)
- Remove invalid coordinates (i.e. non-finite X or Y:
NaN,Inf,-Inf) - Remove repeated points in sequences
Point- keep valid coordinate, orEMPTYLineString- fix coordinate list. If left with 1 or 0 coordinates thenEMPTYLinearRing- fix coordinate list. If valid return as a ring, otherwise return as a LineString.Polygon- transform into a valid polygon, preserving as much of the extent and vertices as possibleMultiPolygon- fix each element, then ensure collection is non-overlapping (via union)GeometryCollection- fix each element
Polygon Repair
There are two approaches for repairing self-intersecting polygons. Both of them start by forming the fully noded arrangement of the polygon linework.
- the Even-Odd strategy treats every delimited region as a potential interior or exterior area, alternating inwards from the exterior.
- the Winding Number strategy uses the winding number of each delimited region to determine interior/exterior.
- this can be extended to use both winding number polarities to determine interior. This has the effect of identifying all "lobes" as interior areas.
See Subramaniam for further explanation.
The nominal shell and hole ring(s) are processed separately, and then the results are combined via a different operation to produce the final result.
A further refinement is to identify holes which are disconnected from (outside) the shell, and treat them separately as additional shells. This aligns with the above semantic of treating all linework lobes as polygons interiors.
Processing Options
The following semantic options may be provided as required.
- Empty elements in Collections:
- Option 1: (default) remove them
- Option 2: keep them
- Collapsed geometry (keepCollapsed)
- Option 1: (default) remove them. Result has same dimension as input.
- Option 2: convert to a representative lower-dimension geometry (note this may result in mixed collections)
- Single-element collections
- Option 1: (default) convert to atomic geometry
- Option 2: Keep result as a single-element collection
- LinearRing fixing
- An alternative is to fix linework, node and return all resulting rings.
- This can be done by converting ring to polygon, fixing polygon, and extracting rings
Prior Art
- PostGIS provides
ST_MakeValid - GEOS provide a
MakeValidcapability (although there are some issues with it) - ESRI Java Geometry API
makeSimple - Thesis by Subramaniam: PARTITION OF A NON-SIMPLE POLYGON INTO SIMPLE POLYGONS. Advocates using the Non-Zero Winding Number (NZW) approach, which is similar to the JTS algorithm defined here.
- Paper (paywalled): A general method for decomposing self-intersecting polygon to normal based on self-intersection points
- Boost Geometry Correct - implements the NZW approach in thesis above
Algorithm
Polygons
- For each Polygon ring
- Convert self-intersecting ring to a valid set of non-adjacent rings, using the Winding Number of enclosed regions
- This can be done by combining the results of two
buffer(0)operations using opposite choices for winding number polarity
- This can be done by combining the results of two
- Convert self-intersecting ring to a valid set of non-adjacent rings, using the Winding Number of enclosed regions
- Disconnected outside holes are converted to shells (and do not participate in subsequent hole removal)
- Union all repaired holes
- Difference holes from original shell Polygon
- Combine with converted outside holes
MultiPolygons
- Fix each element polygon
- Union fixed polygons
Maybe this paper is worth some consideration : https://agile-online.org/conference_paper/cds/agile_2012/proceedings/papers/paper_ledoux_automatically_repairing_invalid_polygons_with_a_constrained_triangulation_2012.pdf
That's a very interesting paper (especially because it talks about JTS and GEOS!).
However, I'm not convinced that a CDT is the best way to fix invalid polygons. One obvious issue is that it requires writing a CDT algorithm, which is a big undertaking. And it's not clear that the complexity provides any better results than the node-polygonize approach above (and likely has worse performance).
The examples of invalidity in the paper are useful. I wonder if that data is available?
Didn't JTS already include a CDT algorithm ? There is a repo on github with a few examples of invalid polygons: https://github.com/tudelft3d/prepair
The approach you propose seems not too far from OpenJUMP's MakeValidOp ;-). Seems that you consider that shell vs hole information is reliable. What if you get a Polygon with a hole enclosing the shell. Should the hole "erases" the polygon or should we consider that shell and hole must be swapped ?
Didn't JTS already include a CDT algorithm ?
Nope, it has Conforming DT, which is definitely not what is needed.
The whole CDT approach just seems way more complicated than needed, and I don't think it answers the hard questions about how to create valid polygons.
There is a repo on github with a few examples of invalid polygons: https://github.com/tudelft3d/prepair
Great, could be handy.
The approach you propose seems not too far from OpenJUMP's MakeValidOp ;-).
No doubt!
Seems that you consider that shell vs hole information is reliable. What if you get a Polygon with a hole enclosing the shell. Should the hole "erases" the polygon or should we consider that shell and hole must be swapped ?
Yes, that's a good example. Either could be considered "right". Although it does feel suspect to convert to an empty geometry - that throws away all the input information. I've added this case to the examples list. The approach is definitely subject to change.
There's a similar question about a hole surrounding another hole. Should the inner hole turn into an island, or disappear?
I have developed a polygon correct approach (remove self-intersection, correct order, correct inners), based on boost geometry: https://github.com/kleunen/boost_geometry_correct
The approach for calculating and removing self-intersection is vastly simpler and more efficient than existing libraries. It was developed for tilemaker: https://github.com/systemed/tilemaker
But can be used for other projects as well.
I have developed a polygon correct approach (remove self-intersection, correct order, correct inners), based on boost geometry: The approach for calculating and removing self-intersection is vastly simpler and more efficient than existing libraries.
Thanks, this looks interesting.
What is approach that is used now for removing self-intersection from polygons ? I looked through the code, but could not find an approch that is used.
What is approach that is used now for removing self-intersection from polygons ?
The buffer-by-zero heuristic is used to turn self-intersecting rings into valid geometry. Code is here. I this produces a similar semantic to the one advocated in the Boost blog post you referenced.
Ah yes. It is common and does work well in many practical cases. But in cases with holes. It will not generate the second polygon. The approach i took does generate these multipolygons.
But in cases with holes. It will not generate the second polygon. The approach i took does generate these multipolygons.
Not sure what you mean - do you have an example?
Note the buffer-by-zero call is just the lowest level of GeometryFixer, to fix invalid single rings. For Polygons with holes and MultiPolygons each ring is processed separately, and then recombined using difference and union.
@kleunen I think GeometryFixer handles the boost_geometry_correct examples in the same way:
POLYGON ((5 0, 2.5 9, 9.5 3.5, 0.5 3.5, 7.5 9, 5 0))

POLYGON ((55 10, 141 237, 249 23, 21 171, 252 169, 24 89, 266 73, 55 10))

POLYGON ((0 0, 10 0, 0 10, 10 10, 0 0, 5 0, 5 10, 0 10, 0 5, 10 5, 10 0, 0 0))

Yes that is the same. I still have to add rebuilding with difference and union. Boost geometry buffer does not allow buffer with 0. So maybe there is difference in implementation there.
Boost buffer add many new points. It seems jts buffer does not. Or is simplification run afterwards?
Boost geometry buffer does not allow buffer with 0. So maybe there is difference in implementation there.
Using buffer-by-zero was just a quick way to get this functionality rolled out (since it is available in JTS). You could duplicate that semantic by using a node-and-build-topology approach, of course. But then you have to implement the noding and topology graph building code, which is non-trivial.
In fact I would prefer the GeometryFixer implementation to not depend on buffer, since that introduces substantial complexity and might cause breakages at some point (although buffer has been stable for a long time). But that's a fairly major project, so not sure when it will get done.
Boost buffer add many new points. It seems jts buffer does not. Or is simplification run afterwards?
Do you mean repeated points? If so, that's a result of de-duplication in the buffer and overlay functions.
Here you can find discussion regarding dissolve/make_valid approach for tilemaker: https://github.com/systemed/tilemaker/issues/246
No, with buffer i get many additional points around the corners. To "round" them. For every corner on polygon you get 3 dots, maybe this is implementation of boost geometry. JTS implementation probably more efficient, generating less support points, because these additional points are not needed.
No, with buffer i get many additional points around the corners. To "round" them. For every corner on polygon you get 3 dots, maybe this is implementation of boost geometry. JTS implementation probably more efficient, generating less support points, because these additional points are not needed.
Yes, that must be Boost geometry buffer behaviour. In the JTS buffer algorithm, buffer distance = 0 is treated as a special case, and the original geometry is used as the offset curve with no other modification. (In contrast, a buffer distance >0 invokes several heuristics to simplify the input linework, and then generates buffer fillets and endcaps as per the input parameters).
@kleunen One question I have about your algorithm (which I presume also applies to the papers mentioned, although I haven't verified that): how robust is it? Numerical precision limitations can cause nodes computed in a line arrangement to "jump" across one another, which leads to inconsistent topology. JTS contains a lot of code to handle this situation (in particular, buffer checks for consistent noding and falls back to using snap-rounding if it occurs).
However, perhaps this is a case where the effects aren't so severe - although I would be surprised if that was the case.
Unfortunately it's hard to reproduce robustness issues, and they tend to manifest differently for each individual algorithm. Generally the only way to find them is by running many real-world examples. And to detect failure cases, you need an independent way to verify the correctness of computed outputs - which is usually not easy to create. For this fix-invalid-geometry situation, you would need a way not only to verify that the computed result is valid, but that it's valid in a sensible way, according to the desired semantics of the algorithm (since otherwise trivial or wildly malformed output that happened to be valid would be accepted).
There are no robustness analysis in the papers. The approach basicly reuses the existing points on the ring and add additional points on the ring on the intersections.
One thing i noticed you start traversing this ring and then follow the intersections to detect the non intersecting sub polygons. When doing this, you have to detect you are back at the starting point. For this i used a distance between the current point and the start point. If you do not detect this case, the algorithm keeps traversing the same ring infinitly.
https://github.com/kleunen/boost_geometry_correct/blob/main/correct.hpp#L230
Also, there is no proof that by generating these pseudo vertices you can follow the subrings and always end up at the starting point. It is like you says. It works in practice but it is not proven you can not end up in a subring which does not have the starting point. And thus start looping infinitly.
Maybe this will make the algorithm more robust. Instead of detecting you are returned at the starting point, you should stop following the ring if you return to a point you already visited.
Moreover duplicate points should be filteren out. But this is same for other algorithms.
In the case of multiple intersections at the same point, there is a choice to be made to which edge to follow, see below:

On the left case, I start at the intersection which was the first on the original ring, and then I follow the intersection which leads back the fastest to the end of the original ring. This leads back to the starting point.
On the right case, i take an edge to a different sub-ring. Getting stuck in this top-left ring, never being able to return to the original ring and starting point, without following a self-intersecting point.
There is no proof, why the first approach always works and why i do not need to perform an exhaustive search at an intersection to find out which route to take to get on the ring which leads back to the starting point. I think this is because if you look at where the intersections are on the position of the original ring, they are actually nested, the subrings.
Also note I calculate "where" the intersections are on the original edge of the ring: https://github.com/kleunen/boost_geometry_correct/blob/main/correct.hpp#L118-L119
This is used to order them such, that they are ordered according to how they appear on the original ring: https://github.com/kleunen/boost_geometry_correct/blob/main/correct.hpp#L58-L71
So possibly, numerical stability might cause two intersection to swap places, but I think this is not an issue, because they are ordered according to the comparable distance on the original edge.
Validating the output is indeed laborious. If the algorithms drops an invalid inner or all inners or returns an empty polygon or multi polygon. The geometry is valid, but clearly is not acceptable. But how many nodes and inners should be in the output? You can not really calculate and automatically test. But i guess you are very familiar with these difficulties.
So possibly, numerical stability might cause two intersection to swap places, but I think this is not an issue, because they are ordered according to the comparable distance on the original edge.
The core problem is that the interesection point of two nearly-collinear lines might be computed to be the opposite side of another line which crosses the two lines. This leads to the computed noded line arrangement still having non-noded intersections. The only solution I've found to handle this is to introducing snapping into the noding subsystem.
Validating the output is indeed laborious. If the algorithms drops an invalid inner or all inners or returns an empty polygon or multi polygon. The geometry is valid, but clearly is not acceptable. But how many nodes and inners should be in the output? You can not really calculate and automatically test. But i guess you are very familiar with these difficulties.
One way to validate is to use a "point probe" algorithm, which is completely different and simpler, and hence a good guarantee of correctness. This involves generating points near every section of input and output linework, and then determining whether they are correctly placed relative to the original linework. If the Non-Zero Winding number rule is used it should be possible to determine the winding number for a point and then confirm it has an appropriate value. This is compute-intensive but relatively straightforward - which computers are good at!
So possibly, numerical stability might cause two intersection to swap places, but I think this is not an issue, because they are ordered according to the comparable distance on the original edge.
The core problem is that the interesection point of two nearly-collinear lines might be computed to be the opposite side of another line which crosses the two lines. This leads to the computed noded line arrangement still having non-noded intersections. The only solution I've found to handle this is to introducing snapping into the noding subsystem.
Yes i think the approach in boost geometry is to simply ignore this. Going the approach that double float or even int have a limit on precision and near this limit issues might occur. The user can always pre snap the geometry before applying a make valid if he/she needs this precision.
But i guess that is a matter of preference. This can occur clearly, because the intersection detection is needed. So if you need this to work correctly, you can pre-snap the nodes. But once the intersections are detected, you only need to detect if you are returned to the starting point when walking the sub-ring. For this, you can use a threshold as well.
The NZW or odd-even can be configured/adjusted. The inner rings are detected, i just convert the orientation before performing the union. The ring detection generates three rings, which need to be combined:
1 = CW, 2 = CW, 3 = CCW
I now apply NZW:


i have an multipolygon, wkt like this : MULTIPOLYGON (((104.84831240300008 26.577005471000064, 104.84889922500008 26.57747835400005, 104.84948523000008 26.576550350000048, 104.84877030700005 26.57614817800004, 104.84817763600006 26.577031617000046, 104.84889922500008 26.57747835400005, 104.84831240300008 26.577005471000064, 104.84886598700007 26.57734819600006, 104.84935045000009 26.57658099500003, 104.84880356000008 26.576273349000076, 104.84831240300008 26.577005471000064)))

when i use GeometryFixer, it bacame to this : POLYGON ((104.84889922500008 26.57747835400005, 104.84948523000008 26.576550350000048, 104.84877030700005 26.57614817800004, 104.84817763600006 26.577031617000046, 104.84889922500008 26.57747835400005))

while use postgis, it bacame to this:
GEOMETRYCOLLECTION (POLYGON ((104.84948523000008 26.576550350000048, 104.84877030700005 26.57614817800004, 104.84817763600006 26.577031617000046, 104.84889922500008 26.57747835400005, 104.84948523000008 26.576550350000048), (104.84880356000008 26.576273349000076, 104.84935045000009 26.57658099500003, 104.84886598700007 26.57734819600006, 104.84831240300008 26.577005471000064, 104.84880356000008 26.576273349000076)), LINESTRING (104.84831240300008 26.577005471000064, 104.84889922500008 26.57747835400005))

i saw this issue: In PostGIS the semantic is to preserve all input vertices if possible. To do this, collapsed edges in polygonal inputs may result in output containing LineStrings. Is this desirable? Perhaps should be an alternate mode?
for me , i think the one which postgis maked is correct. and an alternate mode will make it more adaptable in a lot of scenarios.