openlibm
openlibm copied to clipboard
openlibm performance on ARM server is very poor
I see very poor results on a modern ARM server. Some openlibm implementations are up to 48.68x slower than their libm counterparts. This was a checkout of the main branch at 12f5ffc This is also related to #234, but the performance difference seems to be even more dramatic.
For interest and potential usefulness to #203, I also compared it against an optimized build of musl 1.2.4:
bench-syslibm | bench-openlibm | bench-musl
pow : 78.6387 MPS | pow : 17.3955 MPS | pow : 57.9493 MPS
hypot : 232.7852 MPS | hypot : 4.7823 MPS | hypot : 139.4793 MPS
exp : 317.8124 MPS | exp : 119.9932 MPS | exp : 215.2262 MPS
log : 228.3188 MPS | log : 97.0294 MPS | log : 181.7701 MPS
log10 : 118.6787 MPS | log10 : 73.0237 MPS | log10 : 76.8402 MPS
sin : 133.0101 MPS | sin : 135.6112 MPS | sin : 165.5926 MPS
cos : 144.4003 MPS | cos : 127.8527 MPS | cos : 150.5435 MPS
tan : 105.8875 MPS | tan : 68.8512 MPS | tan : 78.9428 MPS
asin : 178.2302 MPS | asin : 9.6621 MPS | asin : 88.3722 MPS
acos : 154.1304 MPS | acos : 9.9192 MPS | acos : 98.5818 MPS
atan : 190.8853 MPS | atan : 91.6229 MPS | atan : 97.0451 MPS
atan2 : 56.6821 MPS | atan2 : 42.4876 MPS | atan2 : 47.6644 MPS
GNU libc version: 2.35 GNU libc release: stable
The openlibm compilation line looks like:
cc -fno-gnu89-inline -fno-builtin -O3 -fPIC -std=c99 -Wall -I/home/user/openlibm -I/home/user/openlibm/include -I/home/user/openlibm/aarch64 -I/home/user/openlibm/src -DASSEMBLER -D__BSD_VISIBLE -Wno-implicit-function-declaration -I/home/user/openlibm/ld128 -c src/e_j0.c -o src/e_j0.c.o
I have tried compiling openlibm with just bare make
, and also specifying the architecture directly with make ARCH=aarch64
to identical results.
Is there something we can do about this?
Some important information is missing. What operating system? What compiler/toolchain? What happens if -fno-gnu89-inline is removed from the command line? If you're using gcc, what happens if you use -march=native -mtune=native?
Speed isn't everything. Have you checked accuracy?
Ubuntu 22.04, gcc 11.4.0, running on a Neoverse V1 server. Removing fno-gnu89-inline made no difference, and specifying mtune and march also had no affect. I have run the test suite and everything passed. I also went into the test makefile and re-enabled building of test-float-system and test-double-system, and the following was produced:
$ ./test-double-system
testing double (without inline functions)
Failure: Test: cbrt (-27.0) == -3.0
Result:
is: -3.00000000000000044409e+00 -0x1.80000000000010000000p+1
should be: -3.00000000000000000000e+00 -0x1.80000000000000000000p+1
difference: 4.44089209850062616169e-16 0x1.00000000000000000000p-51
ulp : 1.0000
max.ulp : 0.0000
Failure: Test: cbrt (0.970299) == 0.99
Result:
is: 9.90000000000000102141e-01 0x1.fae147ae147af0000000p-1
should be: 9.89999999999999991118e-01 0x1.fae147ae147ae0000000p-1
difference: 1.11022302462515654042e-16 0x1.00000000000000000000p-53
ulp : 1.0000
max.ulp : 0.0000
Failure: Test: y0 (1.5) == 0.38244892379775884396
Result:
is: 3.82448923797758966181e-01 0x1.87a0b0d06836a0000000p-2
should be: 3.82448923797758855159e-01 0x1.87a0b0d0683680000000p-2
difference: 1.11022302462515654042e-16 0x1.00000000000000000000p-53
ulp : 2.0000
max.ulp : 1.0000
Failure: Test: yn (0, 1.5) == 0.38244892379775884396
Result:
is: 3.82448923797758966181e-01 0x1.87a0b0d06836a0000000p-2
should be: 3.82448923797758855159e-01 0x1.87a0b0d0683680000000p-2
difference: 1.11022302462515654042e-16 0x1.00000000000000000000p-53
ulp : 2.0000
max.ulp : 1.0000
Test suite completed:
1118 test cases plus 932 tests for exception flags executed.
4 errors occurred.
$ ./test-float-system
testing float (without inline functions)
Failure: Test: log10 (0.7) == -0.15490195998574316929
Result:
is: -1.54901981353759765625e-01 -0x1.3d3d4000000000000000p-3
should be: -1.54901966452598571777e-01 -0x1.3d3d3e00000000000000p-3
difference: 1.49011611938476562500e-08 0x1.00000000000000000000p-26
ulp : 1.0000
max.ulp : 0.0000
Failure: Test: tgamma (4) == 6
Result:
is: 6.00000047683715820312e+00 0x1.80000200000000000000p+2
should be: 6.00000000000000000000e+00 0x1.80000000000000000000p+2
difference: 4.76837158203125000000e-07 0x1.00000000000000000000p-21
ulp : 1.0000
max.ulp : 0.0000
Test suite completed:
1101 test cases plus 923 tests for exception flags executed.
2 errors occurred.
I was however able to get slightly better results when compiling with clang 14.0:
openlibm-gcc | openlibm-clang
pow : 17.3955 MPS | pow : 17.0689 MPS
hypot : 4.7823 MPS | hypot : 5.4384 MPS
exp : 119.9932 MPS | exp : 123.1021 MPS
log : 97.0294 MPS | log : 110.2028 MPS
log10 : 73.0237 MPS | log10 : 81.5303 MPS
sin : 135.6112 MPS | sin : 149.5867 MPS
cos : 127.8527 MPS | cos : 138.9140 MPS
tan : 68.8512 MPS | tan : 86.8775 MPS
asin : 9.6621 MPS | asin : 12.8771 MPS
acos : 9.9192 MPS | acos : 13.0346 MPS
atan : 91.6229 MPS | atan : 101.9421 MPS
atan2 : 42.4876 MPS | atan2 : 47.1442 MPS
If you would like me to run further tests, I would be more than happy to do so.
Thanks!
Is there something we can do about this?
yes: use the CORE-MATH code, which has efficiency comparable to GNU libc (see https://core-math.gitlabpages.inria.fr/64.pdf) and delivers correct rounding.
Unfortunately, I cannot help at the source code level as I do not have access to an ARM server. There are newer versions of GCC and the release notes show new arm processors have been added as well as changes to code generation since 11.4.0 was released. You may want to try an updated GCC.
In your results, I would look at exp
and log
to see where the time is spent.
@zimmermann6 Unfortunately core-math compiles with errors on arm as it uses x86 intrinsics. We may be able to get around this by using a library to convert the calls into their corresponding NEON instructions, but I would worry that there could be some subtle differences that would throw off the results. Some tests also fail on arm, for example:
$ ./check.sh pow
Running worst cases check in --rndn mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0p+0
Running worst cases check in --rndz mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0p+0
Running worst cases check in --rndu mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0.0000000000001p-1022 z=-0x0p+0
Running worst cases check in --rndd mode...
FAIL x=-0x1p-1 y=0x1p+64 ref=0x0p+0 z=-0x0.0000000000001p-1022
Building with GCC 12.3.0 (which is the latest version in my package manager), the performance is either the same as it was in 11.4.0, or even slightly slower.
Hi @jmather-sesi I can reproduce on cfarm117, I will investigate.
this issue is fixed. For the record, it was due to a different conversion from the double value 0x1p64 to int64_t. I suggest we followup on core-math issues on the core-math mailing list.