Use tighter bounds and faster filters in orientation and isInCircle p…
This PR proposes changes to the 2D orientation and incircle predicates. The orientation predicate is changed to use a tighter error bound coefficient (which should reduce the number of calls to the slower more robust computation for degenerate inputs) and a computation of the error bound that produces fewer branches. The error bound is due to https://doi.org/10.1007/s10543-015-0574-9 , this paper also has a discussion on how fewer branches improve performance in the predicate and a proof for why abs(detleft + detright) works here (if the signs are the same, it is equivalent, if the signs of detleft and detright differ, then det would never be smaller in absolute value than the error bound anyway).
For the incircle predicate, this PR proposes to not generally use long double (which increases precision a little but is not robust as the code notes) but to use a similar error bound filter instead (bound computation based on https://doi.org/10.1007/s10543-023-00975-x , I don't know the source for alternative expression of the incircle determinant, I have first seen it in CGAL). The rational for that is higher performance (roughly 4-10% speedup on my machine for perf_voronoi) and that a more robust computation here is probably motivated by preventing infinite loops in Delaunay Triangulation due to false cases of INTERIOR, which should be impossible with this version (except possibly for inputs near the lower end of the representable magnitude of double that produce denormalized results/underflow).
PS: I also have an implementation of exact predicates that I could port to geos to be run after failures of the fast filters if it would be considered beneficial to the library by its maintainers. I don't know enough about its use cases to make a guess about that. Exact predicates guarantee consistency wrt to predicate calls. They would not guarantee consistency between predicates and computed geometries (e.g. predicate might say some polygons overlap but the computed intersection may then be empty after rounding to double coordinates in the output geometry, even if all intermediate computations were exact).
I don't have a voronoi test yet, but my performance suite shows no differences, with one exception: the prepared point-in-poly test seems to be consistently about 10% faster.
@pramsey Thanks for the feedback. Interesting, point-in-poly may benefit from a faster orientation but I would have expected that to be negligible if its implemented with some ray counting then I'd expect runtime to be dominated by walking over segments that don't intersect the ray and need no orientation check (at least for polygons with many segments). I will try to look into that when I find time.
I would expect the change in the orientation test to be measurable in e.g. convex hull or triangulation without Delaunay requirement and synthetic benchmarks (I ran perf_orientation from the benchmark directory and compared the number of Iterations since the ns timing seems too coarse for such a micro benchmark, I think running it in a larger algorithm is a more meaningful benchmark for this) and the change in the incircle test to be measurable in benchmarks like Delaunay Triangulation or Voronoi diagram (I ran perf_voronoi from the benchmark directory in the geos repository with release config for the numbers in the opening post). The incircle test maybe very significantly on non-x86-architectures where no machine hardware accelerated 80-bit floats are available and long double would compile to function calls to some soft float implementation, e.g. https://godbolt.org/z/EosY5ehjq .
I hooked a Delaunay build into my test harness but couldn't measure any different between this PR and 3.13.
I hooked a Delaunay build into my test harness but couldn't measure any different between this PR and 3.13.
I tested as follows using the Delaunay/Voronoi benchmarks in the benchmark folder from the main repo: I checked out the main branch, went into the directory and:
cd geos
mkdir build_gcc && build_gcc
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_BENCHMARKS=ON -DCMAKE_CXX_FLAGS="-march=native" ..
make #will complain about missing override in another benchmark but doesn't affect the build of perf_voronoi
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo #only on intel
cpupower frequency-set -g performance #only on intel
for i in {1..10}; do taskset -c 2 ./bin/perf_voronoi >> baseline; done;
grep "BM_DelaunayFromSeq/1000000" baseline
BM_DelaunayFromSeq/1000000 5050644908 ns 5045715175 ns 1
BM_DelaunayFromSeq/1000000 5049845175 ns 5045245900 ns 1
BM_DelaunayFromSeq/1000000 6646025308 ns 6638513791 ns 1
BM_DelaunayFromSeq/1000000 6649783153 ns 6642201348 ns 1
BM_DelaunayFromSeq/1000000 6648571528 ns 6641021393 ns 1
BM_DelaunayFromSeq/1000000 5053053815 ns 5048554170 ns 1
BM_DelaunayFromSeq/1000000 5049133353 ns 5044510485 ns 1
BM_DelaunayFromSeq/1000000 5052486875 ns 5047828776 ns 1
BM_DelaunayFromSeq/1000000 5050604032 ns 5046051328 ns 1
BM_DelaunayFromSeq/1000000 5051929265 ns 5047293562 ns 1
#checkout the branch with this patch, same steps, then
for i in {1..10}; do taskset -c 2 ./bin/perf_voronoi >> patched; done;
grep "BM_DelaunayFromSeq/1000000" patched
BM_DelaunayFromSeq/1000000 4659588868 ns 4654149493 ns 1
BM_DelaunayFromSeq/1000000 4654536370 ns 4649391288 ns 1
BM_DelaunayFromSeq/1000000 4653135324 ns 4648135157 ns 1
BM_DelaunayFromSeq/1000000 6127312010 ns 6119523080 ns 1
BM_DelaunayFromSeq/1000000 6125711504 ns 6118105844 ns 1
BM_DelaunayFromSeq/1000000 4655963070 ns 4651342713 ns 1
BM_DelaunayFromSeq/1000000 4652838792 ns 4648184104 ns 1
BM_DelaunayFromSeq/1000000 4654345257 ns 4649771922 ns 1
BM_DelaunayFromSeq/1000000 4666023790 ns 4661365412 ns 1
BM_DelaunayFromSeq/1000000 4653681302 ns 4649277328 ns 1
I honestly wish I knew, where the outliers come from with turbo disabled, maybe the P-core/E-core (it's a mobile i7-12800H) thing can affect it despite fixing the CPU core with taskset. Same steps with CXX=/usr/bin/clang++ before the cmake command yield for me:
grep "BM_DelaunayFromSeq/1000000" baseline
BM_DelaunayFromSeq/1000000 4329365684 ns 4323526473 ns 1
BM_DelaunayFromSeq/1000000 5433784294 ns 5427474646 ns 1
BM_DelaunayFromSeq/1000000 4148863139 ns 4144892923 ns 1
BM_DelaunayFromSeq/1000000 4139543832 ns 4135714625 ns 1
BM_DelaunayFromSeq/1000000 5428346793 ns 5422042258 ns 1
BM_DelaunayFromSeq/1000000 5439176317 ns 5432822740 ns 1
BM_DelaunayFromSeq/1000000 5432680958 ns 5426345764 ns 1
BM_DelaunayFromSeq/1000000 4140026534 ns 4136116714 ns 1
BM_DelaunayFromSeq/1000000 4149153847 ns 4145284595 ns 1
BM_DelaunayFromSeq/1000000 5431052965 ns 5424639386 ns 1
grep "BM_DelaunayFromSeq/1000000" patched
BM_DelaunayFromSeq/1000000 3767896479 ns 3764021258 ns 1
BM_DelaunayFromSeq/1000000 4923848243 ns 4917679386 ns 1
BM_DelaunayFromSeq/1000000 3764235827 ns 3760602930 ns 1
BM_DelaunayFromSeq/1000000 3762541532 ns 3758951324 ns 1
BM_DelaunayFromSeq/1000000 3765034916 ns 3761276158 ns 1
BM_DelaunayFromSeq/1000000 3764714564 ns 3761178183 ns 1
BM_DelaunayFromSeq/1000000 4928664355 ns 4922710265 ns 1
BM_DelaunayFromSeq/1000000 4922105205 ns 4916260250 ns 1
BM_DelaunayFromSeq/1000000 3767456227 ns 3763747766 ns 1
BM_DelaunayFromSeq/1000000 3760288885 ns 3756814473 ns 1
Similar results for the other workloads.
And on an EPYC-Milan I don't have these outlier timings, so I post just the first sample for each workload:
grep 100000 baseline | head -n5
BM_DelaunayFromSeq/1000000 4916505592 ns 4916348118 ns 1
BM_DelaunayFromGeom/1000000 4671339867 ns 4671204309 ns 1
BM_VoronoiFromSeq/1000000 5506559255 ns 5506240288 ns 1
BM_VoronoiFromGeom/1000000 5435366604 ns 5435120888 ns 1
BM_OrderedVoronoiFromGeom/1000000 7680563331 ns 7680220398 ns 1
grep 100000 patched | head -n5
BM_DelaunayFromSeq/1000000 4427675159 ns 4427496208 ns 1
BM_DelaunayFromGeom/1000000 4462223273 ns 4461665059 ns 1
BM_VoronoiFromSeq/1000000 5045797628 ns 5045404499 ns 1
BM_VoronoiFromGeom/1000000 4988454644 ns 4987171839 ns 1
BM_OrderedVoronoiFromGeom/1000000 7392607964 ns 7392126698 ns 1
Thanks for the nice writeup of your perf_* results. I did the same tests on my MacBook Pro and I'm afraid I see no consistent advantage for this particular optimization.
./bin/perf_voronoi --benchmark_filter=1000000 --benchmark_repetitions=10 > baseline
grep mean baseline
BM_DelaunayFromSeq/1000000_mean 1885685129 ns 1885561400 ns 10
BM_DelaunayFromGeom/1000000_mean 1859082321 ns 1858690300 ns 10
BM_VoronoiFromSeq/1000000_mean 2238724242 ns 2238228000 ns 10
BM_VoronoiFromGeom/1000000_mean 2254024629 ns 2253934700 ns 10
BM_OrderedVoronoiFromGeom/1000000_mean 2975758075 ns 2975618100 ns 10
./bin/perf_voronoi --benchmark_filter=1000000 --benchmark_repetitions=10 > patched
grep mean patched
BM_DelaunayFromSeq/1000000_mean 1906208617 ns 1906192700 ns 10
BM_DelaunayFromGeom/1000000_mean 1914398154 ns 1913137200 ns 10
BM_VoronoiFromSeq/1000000_mean 2288991796 ns 2288465000 ns 10
BM_VoronoiFromGeom/1000000_mean 2295409392 ns 2294846200 ns 10
BM_OrderedVoronoiFromGeom/1000000_mean 2993584858 ns 2993345900 ns 10
Would it make sense to gate this behind something like
#if defined(__x86_64__) || defined(__x86_64) || defined(__amd64__) || defined(__amd64) || defined(_M_X64) || defined(_M_AMD64)
# define ARCHITECTURE_X86_64
@pramsey Thanks for getting back to me on this.
The test results are interesting. I do not have a mac to test. But using clangs target option, I find that it seems to default to compile long double to 8 byte, so same as double:
echo "int s = sizeof(long double);" | clang++ -x c++ -S - -O3 -target aarch64-apple-darwin -o - | grep '.long'
.long 8 ; 0x8
So my patch would not benefit the circle predicate performance there from switching down to double. Instead it will cost a little performance due to the slightly more expensive error bound computation (visible in your benchmark, seems to be ~1-2%).
On the other hand, for architectures that are already compiling the current code to 64-bit precision/double, this also means that the current code will be a lot less robust. If I generate 100k degenerate cases, the current code will produce an opposing result (interior when the truth is exterior or the other way around) in 8 cases with 80-bit precision but in ~14k cases with 64-bit precision. All code for the test is in this gist: https://gist.github.com/tinko92/86126244ed04b65aeb09c9223ec4c37c#file-test-sh (note that this test code will exclusively produce degenerate cases where the test point is near the circle boundary up to roughly machine precision).
The patched code will never confuse interior and exterior, regardless of availability of 80-bit precision doubles but always produce the correct sign or BOUNDARY. If you run that gist code on your Macbook, I would expect that you see many results in which the 2nd and 3rd column show opposite signs.
So overall my opinion would be to not gate it. The architectures that lack 80-bit long doubles suffer a 2% performance hit but will benefit from it a lot robustness-wise (with an understanding of robustness where "boundary" is a safe result while confusing exterior and interior is bad) and the architectures that have 80-bit long doubles should benefit from it somewhat performance-wise and still a little robustness-wise (guarantee of not confusing exterior and interior).
This understanding of robustness is based on the idea that confusing interior and exterior could lead to infinite loops in triangle flipping for Delaunay Triangulation and should be avoided for this reason. But I don't know the specifics of geos' Delaunay Triangulation algorithm or the general motivation behind having an incircle-predicate with increased robustness in geos in the first place.
Edit: Discussion of orientation, now in #1184, removed.
It would be better if this was split into 2 PRs, one for each algorithm. This would separate the different discussions, tests, and changes.
That makes sense, since most of the discussion is around the incircle predicate, I'll split out the orientation predicate into a separate PR.
Edit: If there is interest in making this method robust as the name suggests, I can add a commit with a slightly transformed version of Shewchuk's public domain implementation, where all defines are changed to constexpr functions. https://gist.github.com/tinko92/8229b4d90ed0dddc410ec22ce5d75b2f . This would guarantee correct predicate results for every input that does not cause over or underflow (so anything with coordinates whose absolute values are roughly in the [1e-75,1e75] plus {-0., 0.} range. It would only be executed in case of filter failure and therefore not contribute to runtime for non-degenerate inputs.
Edit2: After reading more from the docs, I understand that I should probably prose that to JTS first, this may take some time, though to port to Java.
Recently I got a notebook with a Snapdragon X Elite CPU whose variants run the new Windows for ARM laptops. On WSL for this architecture (aarch64), long double is compiled to a 128-bit type by default
echo "int s = sizeof(long double);" | g++ -x c++ -S - -O3 -march=native -o - | grep '.word'
.word 16
and clearly soft-float:
echo "long double f(long double a, long double b) { return a + b; }" | clang++ -x c++ -S - -O3 -march=native -o -
//...
.cfi_startproc
// %bb.0:
b __addtf3
.Lfunc_end0:
//...
(__addtf3 being the name of an runtime helper add function for floating-point emulation for long double).
For this platform the performance differences are therefore very significant:
grep 1000000 baseline | head -n10
BM_DelaunayFromSeq/1000000 1.5618e+10 ns 1.5570e+10 ns 1
BM_DelaunayFromGeom/1000000 1.5766e+10 ns 1.5699e+10 ns 1
BM_VoronoiFromSeq/1000000 1.1918e+10 ns 1.1901e+10 ns 1
BM_VoronoiFromGeom/1000000 1.1941e+10 ns 1.1923e+10 ns 1
BM_OrderedVoronoiFromGeom/1000000 1.3377e+10 ns 1.3353e+10 ns 1
BM_DelaunayFromSeq/1000000 1.5252e+10 ns 1.5206e+10 ns 1
BM_DelaunayFromGeom/1000000 1.4999e+10 ns 1.4927e+10 ns 1
BM_VoronoiFromSeq/1000000 1.1669e+10 ns 1.1649e+10 ns 1
BM_VoronoiFromGeom/1000000 1.1558e+10 ns 1.1542e+10 ns 1
BM_OrderedVoronoiFromGeom/1000000 1.2732e+10 ns 1.2714e+10 ns 1
grep 1000000 patched | head -n10
BM_DelaunayFromSeq/1000000 3041869000 ns 3030840500 ns 1
BM_DelaunayFromGeom/1000000 2619609077 ns 2616074300 ns 1
BM_VoronoiFromSeq/1000000 3260589305 ns 3254602500 ns 1
BM_VoronoiFromGeom/1000000 3040886840 ns 3035463600 ns 1
BM_OrderedVoronoiFromGeom/1000000 4423922233 ns 4415210600 ns 1
BM_DelaunayFromSeq/1000000 3048331830 ns 3016170200 ns 1
BM_DelaunayFromGeom/1000000 2988821373 ns 2931008200 ns 1
BM_VoronoiFromSeq/1000000 3242640471 ns 3236409000 ns 1
BM_VoronoiFromGeom/1000000 3035475650 ns 3030037400 ns 1
BM_OrderedVoronoiFromGeom/1000000 4422259306 ns 4413752800 ns 1
It is expected that soft-floats are much slower than hardware float.
So overall, the effect of this patch on performance is:
- 1-2% slower on on ARM Mac because double and long double are the same on that platform and the added error bound computation is extra work.
- 4-10% faster on x86-64 because it uses 80-bit long double's that are hardware-accelerated through the legacy FPU features but generate slightly less performant assembly from not interacting well with the modern instruction sets.
- ~3 times faster with the default settings on the new Snapdragon X ARM CPUs, where clang and gcc generate soft-floats for long double.
The effect on portability is, that this patched version will produce the same predicate results on all of the above platform, because 64-bit double has the same behaviour for all of them, while the current code can produce different results for each of the above platforms because long double means something different for each of them.
The effect on behaviour is that this predicate will never produce the opposite of the true predicate result on any platform for degenerate cases like the current main code but more 0-results (point is found to be not inside or outside the disc but exactly on the boundary circle) for degenerate cases than the current main code.
This patch can be a step towards a filtered, exact predicate, if an exact in-circle calculation (like is proposed in a currently open PR for JTS https://github.com/locationtech/jts/pull/1094 ) is added later.
OK, I'm convinced. Could you throw some commentary into the patch with the references you are using and maybe a brief description. The magic constant in particular needs a reference. And a link to this PR.
I added comments with a link to the PR, a permalink (which is unfortunately long) to the line CGAL that uses this alternative incircle expression and to the paper and the accompanying zenodo repository that the error bound computation is based on. The Zenodo repo contains a docker image with a Jupyter Notebook where the general error bound generation for arbitrary expressions is shown.
Edit: I now noticed the single check failure. Is that a heuristic check that can be ignored or does it require another update over the previous commit? Asking because I only added comments and am not familiar with the codecov tool.