libnfporb icon indicating copy to clipboard operation
libnfporb copied to clipboard

Investigate libnfporb performance

Open intractabilis opened this issue 2 years ago • 36 comments

The orbital approach is expected to be very efficient, at least this is the impression I got after reading papers. However, in my experiments it is at least 2 orders of magnitude slower than CGAL's reduced convolution method. It would be nice to investigate if the slowness is a defect of the algorithm, or we can improve the implementation.

I've sent the test data privately. It consists of 36 polygons numbered from 0 to 35. On the polygon pair (2, 18) libnfporb hangs indefinitely. The following pairs: (3, 14), (3, 16), (3, 17), (8, 17), (12, 14), (12, 16), (12, 17), (15, 20), (21, 24), (24, 25), and (28, 30) end up throwing an exception with a message "Unable to complete outer nfp loop: 1".

Here is the performance comparison on pairs where libnfporb is the slowest compared to CGAL:

Run on (16 X 3874.93 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.93, 0.71, 0.68
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4523011 ns      4520737 ns          155
benchmark_cgal/24/31    2876381 ns      2874814 ns          244
benchmark_cgal/24/26    1987337 ns      1986189 ns          352
benchmark_cgal/24/33    2306633 ns      2305401 ns          297
benchmark_cgal/23/25    1619837 ns      1619050 ns          430
benchmark_orb/24/27  5694047455 ns   5692760057 ns            1
benchmark_orb/24/31  3049576536 ns   3048878609 ns            1
benchmark_orb/24/26  1877847975 ns   1877413520 ns            1
benchmark_orb/24/33  2259959241 ns   2259457033 ns            1
benchmark_orb/23/25  1518002236 ns   1517642500 ns            1

intractabilis avatar Jun 02 '22 03:06 intractabilis

It's due to several defects in this much cited paper. For example they propose to use the lowest point of polygon B as a reference point... But what if there are two or more equally low points? And that's just one example that breaks their algorithm... And I think I know why they where convinced they had a good solution, while it breaks on so many of the tests i created - They used random generated data for testing and random generated data has an extremely low chance of forming parallels between polygon A and B. That is what it comes down to (also the first example is such a case): parallels. They just didn't consider them. Anyway, libnfporb is a collection of workarounds that make the orbiting approach work kind of as proposed. The bad news: that makes libnfporb slow. The good news: It all comes down to one function (trimVector) that consumes virtually all the cpu time in my tests. But I haven't checked your results yet.

kallaballa avatar Jun 02 '22 06:06 kallaballa

Out of curiosity, why are you looking for a NFP implementation? Can you talk about it?

kallaballa avatar Jun 04 '22 08:06 kallaballa

The orbital approach is expected to be very efficient, at least this is the impression I got after reading papers. However, in my experiments it is at least 2 orders of magnitude slower than CGAL's reduced convolution method. It would be nice to investigate if the slowness is a defect of the algorithm, or we can improve the implementation.

I've sent the test data privately. It consists of 36 polygons numbered from 0 to 35. On the polygon pair (2, 18) libnfporb hangs indefinitely. The following pairs: (3, 14), (3, 16), (3, 17), (8, 17), (12, 14), (12, 16), (12, 17), (15, 20), (21, 24), (24, 25), and (28, 30) end up throwing an exception with a message "Unable to complete outer nfp loop: 1".

Here is the performance comparison on pairs where libnfporb is the slowest compared to CGAL:

Run on (16 X 3874.93 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 24576 KiB (x1)
Load Average: 0.93, 0.71, 0.68
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4523011 ns      4520737 ns          155
benchmark_cgal/24/31    2876381 ns      2874814 ns          244
benchmark_cgal/24/26    1987337 ns      1986189 ns          352
benchmark_cgal/24/33    2306633 ns      2305401 ns          297
benchmark_cgal/23/25    1619837 ns      1619050 ns          430
benchmark_orb/24/27  5694047455 ns   5692760057 ns            1
benchmark_orb/24/31  3049576536 ns   3048878609 ns            1
benchmark_orb/24/26  1877847975 ns   1877413520 ns            1
benchmark_orb/24/33  2259959241 ns   2259457033 ns            1
benchmark_orb/23/25  1518002236 ns   1517642500 ns            1

Excellent work. I get the same results and it makes sense, given that libnfporb is stable but as I said has many workarounds for shortcomings of the mentioned paper.

kallaballa avatar Jun 04 '22 08:06 kallaballa

6 seconds still sounds a bit excessive. I ran a microarchitecture benchmark. Your code issues 259,000,000 floating-point instructions for polygons 24 and 27. This amount of FLOP is enough to calculate FFT of a 512x512 image 16 times.

intractabilis avatar Jun 04 '22 19:06 intractabilis

I don't know what to say except that I am glad I could get it to work despite the defects in the paper. But anyway, I am very interested in making it faster but I am not aware of low hanging fruits. As simple as it seems to orbit one polygon around another (well there are also holes, interlocks, etc.) it is really, really hard to get right. :) Btw. The majority of the CPU time is spent in boost::geometry functions (intersect, overlap, etc.)

kallaballa avatar Jun 04 '22 20:06 kallaballa

Have you tried contacting the authors for comments? Emails are in the paper. Maybe they have some secret sauce.

intractabilis avatar Jun 04 '22 20:06 intractabilis

Yep. No answer.

kallaballa avatar Jun 04 '22 20:06 kallaballa

Well, I have to thank you for the great amount of work you've done to show this algorithm is impractical. The authors' silence probably means they know very well about the problems with their paper. If not your effort, I would most likely have wasted my time implementing it.

intractabilis avatar Jun 05 '22 00:06 intractabilis

In my opinion there is no serious science without open source. especially in computer science that would mean no more bullshitting. I know that there are factors in play, like software-patents, that make the matter more complicated than that. But what is more important? software-patents or a trustworthy science? Anyway, I'm glad that my effort is helpful. Also, I was rewarded with two citations, in a bachelor and in a master thesis that used libnfporb. And if that isn't ironic enough, I don't even have an A-level. Thank you for your comprehension.

kallaballa avatar Jun 05 '22 01:06 kallaballa

I get the same results and it makes sense, given that libnfporb is stable but as I said has many workarounds for shortcomings of the mentioned paper.

Uhm. I just wanted to profile 24/27 using perf (using the "profile" build target of the libnfporb Makefile) but first I tested with the release build and realized that it only took 2s for 1 iteration, which is remarkable since my machine is way less powerful than yours (as you'll see in the report following) . Anyway, when I run you benchmark suite i get the following result:

Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.43, 0.51, 0.36
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27   61367284 ns     61366863 ns           10
benchmark_cgal/24/31   35310071 ns     35307597 ns           20
benchmark_cgal/24/26   24239106 ns     24235630 ns           29
benchmark_cgal/24/33   30250984 ns     30250819 ns           23
benchmark_cgal/23/25   19487795 ns     19486422 ns           36
benchmark_orb/24/27  49712943159 ns   49711742870 ns            1
benchmark_orb/24/31  31513661714 ns   31510963354 ns            1
benchmark_orb/24/26  18847734019 ns   18847256122 ns            1
benchmark_orb/24/33  22661594322 ns   22660991906 ns            1
benchmark_orb/23/25  15026137744 ns   15025408670 ns            1

49s for 1 iteration of 24/27. At the moment I am looking through the build system of the benchmark suite but and i found the following compile command:

c++ -Infp-orb-perf-check.p -I. -I.. -I../include -I/usr/local/include -I/usr/include -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=gnu++20 -O0 -g -DFMT_LOCALE -DFMT_SHARED -DCGAL_USE_GMPXX=1 -DBOOST_FILESYSTEM_DYN_LINK=1 -DBOOST_ALL_NO_LIB -DDEBUG -march=native -mtune=native -Wno-deprecated-enum-enum-conversion -ffunction-sections -fdata-sections -fPIC -MD -MQ nfp-orb-perf-check.p/nfp-orb-perf-check.cpp.o -MF nfp-orb-perf-check.p/nfp-orb-perf-check.cpp.o.d -o nfp-orb-perf-check.p/nfp-orb-perf-check.cpp.o -c ../nfp-orb-perf-check.cpp

Which made me realize that the default buildtype debug. So I did a release build with following result:

2022-06-05T10:49:20+02:00
Running build/nfp-orb-perf-check
Run on (8 X 4400 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.42, 0.52, 0.47
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4891004 ns      4891003 ns          136
benchmark_cgal/24/31    3130488 ns      3128706 ns          225
benchmark_cgal/24/26    2172407 ns      2172363 ns          322
benchmark_cgal/24/33    2533278 ns      2533082 ns          276
benchmark_cgal/23/25    1806403 ns      1806375 ns          355
benchmark_orb/24/27  3823282100 ns   3823111325 ns            1
benchmark_orb/24/31  2050453801 ns   2050392766 ns            1
benchmark_orb/24/26  1274274826 ns   1274209477 ns            1
benchmark_orb/24/33  1539979505 ns   1539962289 ns            1
benchmark_orb/23/25  1037740713 ns   1037673723 ns            1

which is still slower than 2s.

So i built the benchmark program by hand:

g++ -s -pthread -fno-strict-aliasing -L/usr/local/lib -I/usr/local/include -lboost_filesystem -lboost_system -lbenchmark -lgmp -lmpfr -std=c++20 -march=native -I../../src -DNDEBUG -g0 -Ofast -flto -o nfp-orb-perf-check ../nfp-orb-perf-check.cpp
and got the following result:
2022-06-05T11:35:43+02:00
Running ./build/nfp-orb-perf-check
Run on (8 X 1601.28 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.59, 0.59, 0.60
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4862135 ns      4862111 ns          145
benchmark_cgal/24/31    3097166 ns      3097099 ns          226
benchmark_cgal/24/26    2139914 ns      2139891 ns          327
benchmark_cgal/24/33    2501373 ns      2501171 ns          279
benchmark_cgal/23/25    1745026 ns      1745017 ns          400
benchmark_orb/24/27  2869975098 ns   2869954309 ns            1
benchmark_orb/24/31  1600250962 ns   1600240966 ns            1
benchmark_orb/24/26  1104291322 ns   1104281761 ns            1
benchmark_orb/24/33  1339996282 ns   1339987384 ns            1
benchmark_orb/23/25   895057826 ns    895049078 ns            1

Better but still not as fast as using libnfporb/examples/nfp. Anyway i can accept that difference for the moment because it isn't that much. Could you try doing a manual build on the latest libnfporb revision like I did and redo the benchmark?

In the meanwhile I'll go back to the code and see If i can squeeze out an order of magnitude or two :)

kallaballa avatar Jun 05 '22 09:06 kallaballa

for the record I am using "g++ (SUSE Linux) 12.1.0"

kallaballa avatar Jun 05 '22 09:06 kallaballa

I found an unnecessary call and now look at that (current HEAD):

2022-06-05T12:37:51+02:00
Running ./build/nfp-orb-perf-check
Run on (8 X 1427.49 MHz CPU s)
CPU Caches:
CXX      := g++
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.52, 0.60, 0.58
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4783350 ns      4780879 ns          148
benchmark_cgal/24/31    3035116 ns      3035073 ns          231
benchmark_cgal/24/26    2101172 ns      2101139 ns          331
benchmark_cgal/24/33    2455099 ns      2454937 ns          286
benchmark_cgal/23/25    1717539 ns      1717502 ns          407
benchmark_orb/24/27   445997757 ns    445992804 ns            2
benchmark_orb/24/31   267225518 ns    267222110 ns            3
benchmark_orb/24/26   195266817 ns    195265388 ns            4
benchmark_orb/24/33   237397545 ns    237396646 ns            3
benchmark_orb/23/25   165946719 ns    165945385 ns            4

kallaballa avatar Jun 05 '22 10:06 kallaballa

also i looked into what the minkowski-sum appoach you are using is actually doing and found out that the benchmark isn't fair by far because it can't include perfect fits, antennas and interlocks which libnfporb does.

https://stackoverflow.com/questions/52350747/no-fit-polygon-problem-with-cgal-minkowski-sum-2

kallaballa avatar Jun 05 '22 11:06 kallaballa

also i looked into what the minkowski-sum appoach you are using is actually doing and found out that the benchmark isn't fair by far because it can't include perfect fits, antennas and interlocks which libnfporb does.

https://stackoverflow.com/questions/52350747/no-fit-polygon-problem-with-cgal-minkowski-sum-2

libnfporb doesn't work on the mentioned in the post example

    auto Aorb = libnfporb::polygon_t{
        {{0, 0}, {10, 0}, {10, 10}, {8, 10}, {8, 2}, {2, 2}, {2, 10}, {0, 10}}, {}
    };
    auto Borb = libnfporb::polygon_t{
        {{0, 0}, {6, 0}, {6, 6}, {4, 6}, {4, 2}, {2, 2}, {2, 6}, {0, 6}}, {}
    };
    auto Corb = libnfporb::generate_nfp(Aorb, Borb, true);

It stops with

../../include/libnfporb.hpp:78: void libnfporb::remove_co_linear(boost::geometry::model::polygon<point_t, false, true>::ring_type&): Assertion `r.size() > 2' failed

Even if you fix this, it doesn't work on completely not exotic examples. It fails (probably intermittently) at least on 12 pairs of polygons I sent you. See my initial post.

intractabilis avatar Jun 05 '22 14:06 intractabilis

I found an unnecessary call and now look at that (current HEAD):

2022-06-05T12:37:51+02:00
Running ./build/nfp-orb-perf-check
Run on (8 X 1427.49 MHz CPU s)
CPU Caches:
CXX      := g++
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)
Load Average: 0.52, 0.60, 0.58
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
benchmark_cgal/24/27    4783350 ns      4780879 ns          148
benchmark_cgal/24/31    3035116 ns      3035073 ns          231
benchmark_cgal/24/26    2101172 ns      2101139 ns          331
benchmark_cgal/24/33    2455099 ns      2454937 ns          286
benchmark_cgal/23/25    1717539 ns      1717502 ns          407
benchmark_orb/24/27   445997757 ns    445992804 ns            2
benchmark_orb/24/31   267225518 ns    267222110 ns            3
benchmark_orb/24/26   195266817 ns    195265388 ns            4
benchmark_orb/24/33   237397545 ns    237396646 ns            3
benchmark_orb/23/25   165946719 ns    165945385 ns            4

It's still 2 orders of magnitude difference. However, it's an impressive 10 times gain. Maybe 2 more unnecessary calls, and we will be on par with CGAL? :)

intractabilis avatar Jun 05 '22 14:06 intractabilis

So i built the benchmark program by hand:

I think the difference is because you are using long double, and my compilation asks the compiler to use SIMD. long double not only prevents using advanced CPU features, but also incurs extra work when the rest of the code uses SIMD.

intractabilis avatar Jun 05 '22 15:06 intractabilis

Even if you fix this, it doesn't work on completely not exotic examples. It fails (probably intermittently) at least on 12 pairs of polygons I sent you. See my initial post.

Let's see what I can do.

kallaballa avatar Jun 06 '22 08:06 kallaballa

g++ -s -pthread -fno-strict-aliasing -L/usr/local/lib -I/usr/local/include -lboost_filesystem -lboost_system -lbenchmark -lgmp -lmpfr -std=c++20 -march=native -I../../src -DNDEBUG -g0 -Ofast -flto -o nfp-orb-perf-check ../nfp-orb-perf-check.cpp

you are right, i forgot to tune. Now i tried with mtune=native and mtune=intel and native made it slower while intel made no difference.

kallaballa avatar Jun 06 '22 08:06 kallaballa

you are right, i forgot to tune. Now i tried with mtune=native and mtune=intel and native made it slower while intel made no difference.

mtune should be irrelevant because it doesn't change the instruction set.

intractabilis avatar Jun 06 '22 11:06 intractabilis

You are right! But that means I've configuring with SIMD in each of my examples because I used march=native

kallaballa avatar Jun 06 '22 11:06 kallaballa

You are right! But that means I've configuring with SIMD in each of my examples because I used march=native

Ok, then I am wrong about long double.

intractabilis avatar Jun 06 '22 12:06 intractabilis

You are right! But that means I've configuring with SIMD in each of my examples because I used march=native

Ok, then I am wrong about long double.

are you test with boost nfp or clliper?clipper faster than cgal or to boost(mu qian zui hao de bi cgal kuai 2--3 bei)

YggSky avatar Jun 15 '22 07:06 YggSky

are you test with boost nfp or clliper?clipper faster than cgal or to boost(mu qian zui hao de bi cgal kuai 2--3 bei)

As far as I know, clipper doesn't work with non-convex shapes. Boost doesn't have NFP.

intractabilis avatar Jun 15 '22 23:06 intractabilis

are you test with boost nfp or clliper?clipper faster than cgal or to boost(mu qian zui hao de bi cgal kuai 2--3 bei)

As far as I know, clipper doesn't work with non-convex shapes. Boost doesn't have NFP.

i'm sure you are not use…… clipper can wrok with non-convex,because it Convex decomposition and final across bool operate merage all Convex that is decomposed get non-convex NFP。

you just test can know. or you can see origin code ,that clearly explain what you question, why it can deal with non-convex shapes.

(ganjue nishige xin shou zai nfp fangmian)

YggSky avatar Jun 16 '22 08:06 YggSky

Decomposition cannot be faster than reduced convolution, unless it's something very trivial. Show comparative benchmarks on non-convex polygons with a hundred of vertices.

intractabilis avatar Jun 17 '22 22:06 intractabilis

Decomposition cannot be faster than reduced convolution, unless it's something very trivial. Show comparative benchmarks on non-convex polygons with a hundred of vertices.

but it reall fast than cgal。probably,i am not understand it true theory。or it use efficiency deal with non-convex polygons compare to cgal。

YggSky avatar Jun 20 '22 03:06 YggSky

Friend, I am ready to believe you, but just post your benchmarks. In my understanding, decomposition cannot be faster than the reduced convolution, besides trivial cases. Present your benchmarks for non-convex polygons with a hundred or so vertices.

intractabilis avatar Jun 20 '22 04:06 intractabilis

Friend, I am ready to believe you, but just post your benchmarks. In my understanding, decomposition cannot be faster than the reduced convolution, besides trivial cases. Present your benchmarks for non-convex polygons with a hundred or so vertices.

could give me your test data?upload git is better.

YggSky avatar Jun 20 '22 07:06 YggSky

Absolutely, I can give you even a benchmark program. Unfortunately, I cannot publish the data because of the legal issues. Any email?

intractabilis avatar Jun 20 '22 15:06 intractabilis

btw. I will be back for this issue, just other things in the pipeline. :)

kallaballa avatar Jun 20 '22 16:06 kallaballa