jdk
jdk copied to clipboard
8312425: [vectorapi] AArch64: Optimize vector math operations with SLEEF
Hi, Can you help to review the patch? This pr is based on previous work and discussion in pr 16234, pr 18294.
Compared with previous prs, the major change in this pr is to integrate the source of sleef (for the steps, please check src/jdk.incubator.vector/linux/native/libvectormath/README), rather than depends on external sleef things (header or lib) at build or run time.
Besides of this change, also modify the previous changes accordingly, e.g. remove some uncessary files or changes especially in make dir of jdk.
Besides of the code changes, one important task is to handle the legal process.
Thanks!
Performance
Options
- +intrinsic: 'FORK=1;ITER=10;WARMUP_ITER=10;JAVA_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+EnableVectorSupport -XX:+UseVectorStubs'
- -intrinsic: 'FORK=1;ITER=10;WARMUP_ITER=10;JAVA_OPTIONS=-XX:+UnlockExperimentalVMOptions -XX:+EnableVectorSupport -XX:-UseVectorStubs'
Float
data
| Benchmark | (size) | Mode | Cnt | Error | Units | Score +intrinsic (UseSVE=1) | Score -intrinsic | Improvement(UseSVE=1) | Score +intrinsic (UseSVE=0) | Score -intrinsic | Improvement (UseSVE=0) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Float128Vector.ACOS | 1024 | thrpt | 10 | 0.015 | ops/ms | 245.439 | 101.483 | 2.419 | 245.733 | 102.033 | 2.408 |
| Float128Vector.ASIN | 1024 | thrpt | 10 | 0.013 | ops/ms | 296.702 | 103.559 | 2.865 | 296.741 | 103.18 | 2.876 |
| Float128Vector.ATAN | 1024 | thrpt | 10 | 0.004 | ops/ms | 196.862 | 49.627 | 3.967 | 195.891 | 49.771 | 3.936 |
| Float128Vector.ATAN2 | 1024 | thrpt | 10 | 0.021 | ops/ms | 135.088 | 32.449 | 4.163 | 135.721 | 32.579 | 4.166 |
| Float128Vector.CBRT | 1024 | thrpt | 10 | 0.004 | ops/ms | 114.547 | 39.517 | 2.899 | 114.756 | 39.273 | 2.922 |
| Float128Vector.COS | 1024 | thrpt | 10 | 0.006 | ops/ms | 93.226 | 62.883 | 1.483 | 93.195 | 63.116 | 1.477 |
| Float128Vector.COSH | 1024 | thrpt | 10 | 0.005 | ops/ms | 154.498 | 76.58 | 2.017 | 154.147 | 77.026 | 2.001 |
| Float128Vector.EXP | 1024 | thrpt | 10 | 0.248 | ops/ms | 483.569 | 83.614 | 5.783 | 502.786 | 83.424 | 6.027 |
| Float128Vector.EXPM1 | 1024 | thrpt | 10 | 0.01 | ops/ms | 156.338 | 62.091 | 2.518 | 157.589 | 62.008 | 2.541 |
| Float128Vector.HYPOT | 1024 | thrpt | 10 | 0.007 | ops/ms | 191.217 | 56.834 | 3.364 | 191.247 | 58.624 | 3.262 |
| Float128Vector.LOG | 1024 | thrpt | 10 | 0.019 | ops/ms | 258.223 | 52.005 | 4.965 | 259.642 | 52.018 | 4.991 |
| Float128Vector.LOG10 | 1024 | thrpt | 10 | 0.004 | ops/ms | 238.916 | 43.311 | 5.516 | 240.135 | 43.352 | 5.539 |
| Float128Vector.LOG1P | 1024 | thrpt | 10 | 0.112 | ops/ms | 246.507 | 42.227 | 5.838 | 246.546 | 42.24 | 5.837 |
| Float128Vector.POW | 1024 | thrpt | 10 | 0.033 | ops/ms | 73.78 | 25.17 | 2.931 | 73.693 | 25.113 | 2.934 |
| Float128Vector.SIN | 1024 | thrpt | 10 | 0.004 | ops/ms | 95.509 | 62.807 | 1.521 | 95.792 | 62.883 | 1.523 |
| Float128Vector.SINH | 1024 | thrpt | 10 | 0.011 | ops/ms | 153.177 | 77.586 | 1.974 | 152.97 | 77.248 | 1.98 |
| Float128Vector.TAN | 1024 | thrpt | 10 | 0.002 | ops/ms | 74.394 | 32.662 | 2.278 | 74.491 | 32.639 | 2.282 |
| Float128Vector.TANH | 1024 | thrpt | 10 | 0.005 | ops/ms | 129.308 | 144.581 | 0.894 | 129.319 | 144.916 | 0.892 |
| Float256Vector.ACOS | 1024 | thrpt | 10 | 0.311 | ops/ms | 378.109 | 135.118 | 2.798 | 122.381 | 123.502 | 0.991 |
| Float256Vector.ASIN | 1024 | thrpt | 10 | 1.039 | ops/ms | 452.692 | 135.067 | 3.352 | 126.037 | 123.53 | 1.02 |
| Float256Vector.ATAN | 1024 | thrpt | 10 | 0.017 | ops/ms | 288.785 | 62.032 | 4.655 | 59.783 | 59.821 | 0.999 |
| Float256Vector.ATAN2 | 1024 | thrpt | 10 | 0.065 | ops/ms | 217.573 | 40.843 | 5.327 | 38.337 | 38.352 | 1 |
| Float256Vector.CBRT | 1024 | thrpt | 10 | 0.042 | ops/ms | 185.721 | 49.353 | 3.763 | 46.273 | 46.279 | 1 |
| Float256Vector.COS | 1024 | thrpt | 10 | 0.036 | ops/ms | 163.584 | 78.947 | 2.072 | 70.544 | 70.74 | 0.997 |
| Float256Vector.COSH | 1024 | thrpt | 10 | 0.01 | ops/ms | 211.746 | 96.885 | 2.186 | 84.078 | 84.366 | 0.997 |
| Float256Vector.EXP | 1024 | thrpt | 10 | 0.121 | ops/ms | 954.69 | 117.145 | 8.15 | 97.97 | 97.713 | 1.003 |
| Float256Vector.EXPM1 | 1024 | thrpt | 10 | 0.055 | ops/ms | 213.462 | 79.832 | 2.674 | 74.292 | 74.36 | 0.999 |
| Float256Vector.HYPOT | 1024 | thrpt | 10 | 0.052 | ops/ms | 306.511 | 74.208 | 4.13 | 68.856 | 69.077 | 0.997 |
| Float256Vector.LOG | 1024 | thrpt | 10 | 0.216 | ops/ms | 406.914 | 65.408 | 6.221 | 59.808 | 59.767 | 1.001 |
| Float256Vector.LOG10 | 1024 | thrpt | 10 | 0.37 | ops/ms | 371.385 | 53.156 | 6.987 | 49.334 | 49.171 | 1.003 |
| Float256Vector.LOG1P | 1024 | thrpt | 10 | 1.851 | ops/ms | 397.247 | 52.042 | 7.633 | 50.181 | 50.199 | 1 |
| Float256Vector.POW | 1024 | thrpt | 10 | 0.048 | ops/ms | 115.155 | 27.174 | 4.238 | 24.659 | 24.703 | 0.998 |
| Float256Vector.SIN | 1024 | thrpt | 10 | 0.107 | ops/ms | 154.975 | 79.103 | 1.959 | 70.9 | 70.615 | 1.004 |
| Float256Vector.SINH | 1024 | thrpt | 10 | 0.351 | ops/ms | 202.683 | 97.643 | 2.076 | 84.587 | 84.371 | 1.003 |
| Float256Vector.TAN | 1024 | thrpt | 10 | 0.005 | ops/ms | 127.597 | 37.136 | 3.436 | 34.774 | 34.757 | 1 |
| Float256Vector.TANH | 1024 | thrpt | 10 | 1.233 | ops/ms | 249.084 | 247.272 | 1.007 | 169.903 | 169.805 | 1.001 |
| Float512Vector.ACOS | 1024 | thrpt | 10 | 0.069 | ops/ms | 148.467 | 152.264 | 0.975 | 150.131 | 154.717 | 0.97 |
| Float512Vector.ASIN | 1024 | thrpt | 10 | 0.287 | ops/ms | 147.144 | 158.074 | 0.931 | 147.251 | 148.71 | 0.99 |
| Float512Vector.ATAN | 1024 | thrpt | 10 | 0.101 | ops/ms | 68.498 | 67.987 | 1.008 | 67.968 | 68.131 | 0.998 |
| Float512Vector.ATAN2 | 1024 | thrpt | 10 | 0.016 | ops/ms | 44.189 | 44.052 | 1.003 | 43.898 | 43.781 | 1.003 |
| Float512Vector.CBRT | 1024 | thrpt | 10 | 0.012 | ops/ms | 53.514 | 53.672 | 0.997 | 53.623 | 53.635 | 1 |
| Float512Vector.COS | 1024 | thrpt | 10 | 0.222 | ops/ms | 80.566 | 80.713 | 0.998 | 80.672 | 80.796 | 0.998 |
| Float512Vector.COSH | 1024 | thrpt | 10 | 0.104 | ops/ms | 102.175 | 102.038 | 1.001 | 102.303 | 102.009 | 1.003 |
| Float512Vector.EXP | 1024 | thrpt | 10 | 0.255 | ops/ms | 118.824 | 118.942 | 0.999 | 118.551 | 118.976 | 0.996 |
| Float512Vector.EXPM1 | 1024 | thrpt | 10 | 0.021 | ops/ms | 87.363 | 87.153 | 1.002 | 87.842 | 87.387 | 1.005 |
| Float512Vector.HYPOT | 1024 | thrpt | 10 | 0.048 | ops/ms | 86.838 | 86.439 | 1.005 | 86.903 | 86.709 | 1.002 |
| Float512Vector.LOG | 1024 | thrpt | 10 | 0.017 | ops/ms | 70.794 | 70.746 | 1.001 | 70.469 | 70.62 | 0.998 |
| Float512Vector.LOG10 | 1024 | thrpt | 10 | 0.051 | ops/ms | 55.821 | 55.85 | 0.999 | 55.883 | 55.773 | 1.002 |
| Float512Vector.LOG1P | 1024 | thrpt | 10 | 0.085 | ops/ms | 57.113 | 57.582 | 0.992 | 56.942 | 57.245 | 0.995 |
| Float512Vector.POW | 1024 | thrpt | 10 | 0.006 | ops/ms | 26.66 | 26.656 | 1 | 26.651 | 26.641 | 1 |
| Float512Vector.SIN | 1024 | thrpt | 10 | 0.067 | ops/ms | 80.873 | 80.806 | 1.001 | 80.638 | 80.456 | 1.002 |
| Float512Vector.SINH | 1024 | thrpt | 10 | 0.16 | ops/ms | 103.818 | 102.766 | 1.01 | 102.669 | 103.83 | 0.989 |
| Float512Vector.TAN | 1024 | thrpt | 10 | 0.148 | ops/ms | 38.107 | 37.971 | 1.004 | 37.938 | 37.862 | 1.002 |
| Float512Vector.TANH | 1024 | thrpt | 10 | 1.206 | ops/ms | 237.573 | 235.876 | 1.007 | 236.684 | 236.724 | 1 |
| Float64Vector.ACOS | 1024 | thrpt | 10 | 0.006 | ops/ms | 123.038 | 64.939 | 1.895 | 123.07 | 65.556 | 1.877 |
| Float64Vector.ASIN | 1024 | thrpt | 10 | 0.006 | ops/ms | 148.56 | 65.115 | 2.282 | 148.576 | 66.468 | 2.235 |
| Float64Vector.ATAN | 1024 | thrpt | 10 | 0.003 | ops/ms | 98.512 | 40.569 | 2.428 | 98.458 | 40.932 | 2.405 |
| Float64Vector.ATAN2 | 1024 | thrpt | 10 | 0.004 | ops/ms | 67.706 | 24.824 | 2.727 | 68.214 | 25.157 | 2.712 |
| Float64Vector.CBRT | 1024 | thrpt | 10 | 0.001 | ops/ms | 57.299 | 29.725 | 1.928 | 57.343 | 29.279 | 1.959 |
| Float64Vector.COS | 1024 | thrpt | 10 | 0.008 | ops/ms | 46.689 | 44.153 | 1.057 | 46.67 | 43.683 | 1.068 |
| Float64Vector.COSH | 1024 | thrpt | 10 | 0.005 | ops/ms | 77.552 | 51.012 | 1.52 | 77.66 | 51.285 | 1.514 |
| Float64Vector.EXP | 1024 | thrpt | 10 | 0.257 | ops/ms | 242.736 | 54.277 | 4.472 | 248.345 | 54.298 | 4.574 |
| Float64Vector.EXPM1 | 1024 | thrpt | 10 | 0.003 | ops/ms | 78.741 | 45.22 | 1.741 | 79.082 | 45.396 | 1.742 |
| Float64Vector.HYPOT | 1024 | thrpt | 10 | 0.002 | ops/ms | 95.716 | 36.135 | 2.649 | 95.702 | 36.424 | 2.627 |
| Float64Vector.LOG | 1024 | thrpt | 10 | 0.006 | ops/ms | 130.395 | 38.954 | 3.347 | 130.321 | 38.99 | 3.342 |
| Float64Vector.LOG10 | 1024 | thrpt | 10 | 0.003 | ops/ms | 119.783 | 33.912 | 3.532 | 120.254 | 33.951 | 3.542 |
| Float64Vector.LOG1P | 1024 | thrpt | 10 | 0.006 | ops/ms | 123.966 | 34.381 | 3.606 | 123.984 | 34.291 | 3.616 |
| Float64Vector.POW | 1024 | thrpt | 10 | 0.003 | ops/ms | 36.872 | 21.747 | 1.695 | 36.774 | 21.639 | 1.699 |
| Float64Vector.SIN | 1024 | thrpt | 10 | 0.002 | ops/ms | 48.008 | 44.076 | 1.089 | 48.001 | 43.989 | 1.091 |
| Float64Vector.SINH | 1024 | thrpt | 10 | 0.004 | ops/ms | 76.711 | 50.893 | 1.507 | 76.936 | 51.236 | 1.502 |
| Float64Vector.TAN | 1024 | thrpt | 10 | 0.006 | ops/ms | 37.286 | 26.095 | 1.429 | 37.283 | 26.06 | 1.431 |
| Float64Vector.TANH | 1024 | thrpt | 10 | 0.004 | ops/ms | 64.71 | 79.799 | 0.811 | 64.741 | 79.924 | 0.81 |
| FloatMaxVector.ACOS | 1024 | thrpt | 10 | 0.103 | ops/ms | 378.138 | 136.187 | 2.777 | 245.725 | 102.05 | 2.408 |
| FloatMaxVector.ASIN | 1024 | thrpt | 10 | 1.013 | ops/ms | 452.441 | 135.287 | 3.344 | 296.708 | 103.589 | 2.864 |
| FloatMaxVector.ATAN | 1024 | thrpt | 10 | 0.028 | ops/ms | 288.802 | 62.021 | 4.657 | 196.817 | 49.824 | 3.95 |
| FloatMaxVector.ATAN2 | 1024 | thrpt | 10 | 0.037 | ops/ms | 216.386 | 40.889 | 5.292 | 135.756 | 32.75 | 4.145 |
| FloatMaxVector.CBRT | 1024 | thrpt | 10 | 0.269 | ops/ms | 187.141 | 49.382 | 3.79 | 114.819 | 39.203 | 2.929 |
| FloatMaxVector.COS | 1024 | thrpt | 10 | 0.014 | ops/ms | 163.726 | 78.882 | 2.076 | 93.184 | 63.087 | 1.477 |
| FloatMaxVector.COSH | 1024 | thrpt | 10 | 0.006 | ops/ms | 212.544 | 97.49 | 2.18 | 154.547 | 77.685 | 1.989 |
| FloatMaxVector.EXP | 1024 | thrpt | 10 | 0.048 | ops/ms | 955.792 | 117.15 | 8.159 | 488.526 | 83.227 | 5.87 |
| FloatMaxVector.EXPM1 | 1024 | thrpt | 10 | 0.01 | ops/ms | 213.435 | 79.837 | 2.673 | 157.618 | 62.006 | 2.542 |
| FloatMaxVector.HYPOT | 1024 | thrpt | 10 | 0.041 | ops/ms | 308.446 | 74.165 | 4.159 | 191.259 | 58.628 | 3.262 |
| FloatMaxVector.LOG | 1024 | thrpt | 10 | 0.105 | ops/ms | 405.824 | 65.604 | 6.186 | 257.679 | 51.992 | 4.956 |
| FloatMaxVector.LOG10 | 1024 | thrpt | 10 | 0.186 | ops/ms | 371.417 | 53.204 | 6.981 | 240.117 | 43.427 | 5.529 |
| FloatMaxVector.LOG1P | 1024 | thrpt | 10 | 0.713 | ops/ms | 395.943 | 52.002 | 7.614 | 246.515 | 42.196 | 5.842 |
| FloatMaxVector.POW | 1024 | thrpt | 10 | 0.079 | ops/ms | 115.35 | 27.143 | 4.25 | 73.411 | 25.226 | 2.91 |
| FloatMaxVector.SIN | 1024 | thrpt | 10 | 0.04 | ops/ms | 154.421 | 79.424 | 1.944 | 95.548 | 62.973 | 1.517 |
| FloatMaxVector.SINH | 1024 | thrpt | 10 | 0.04 | ops/ms | 202.51 | 97.974 | 2.067 | 153.3 | 77.106 | 1.988 |
| FloatMaxVector.TAN | 1024 | thrpt | 10 | 0.013 | ops/ms | 127.56 | 36.981 | 3.449 | 74.483 | 32.733 | 2.275 |
| FloatMaxVector.TANH | 1024 | thrpt | 10 | 0.792 | ops/ms | 247.428 | 247.743 | 0.999 | 129.375 | 144.932 | 0.893 |
| FloatScalar.ACOS | 1024 | thrpt | 10 | 0.09 | ops/ms | 337.034 | 337.102 | 1 | 336.994 | 337.001 | 1 |
| FloatScalar.ASIN | 1024 | thrpt | 10 | 0.096 | ops/ms | 351.308 | 351.34 | 1 | 351.273 | 351.293 | 1 |
| FloatScalar.ATAN | 1024 | thrpt | 10 | 0.008 | ops/ms | 91.71 | 91.657 | 1.001 | 91.627 | 91.403 | 1.002 |
| FloatScalar.ATAN2 | 1024 | thrpt | 10 | 0.004 | ops/ms | 58.171 | 58.206 | 0.999 | 58.21 | 58.184 | 1 |
| FloatScalar.CBRT | 1024 | thrpt | 10 | 0.112 | ops/ms | 67.946 | 67.961 | 1 | 67.97 | 67.973 | 1 |
| FloatScalar.COS | 1024 | thrpt | 10 | 0.144 | ops/ms | 109.93 | 109.944 | 1 | 109.961 | 110.002 | 1 |
| FloatScalar.COSH | 1024 | thrpt | 10 | 0.008 | ops/ms | 136.223 | 136.357 | 0.999 | 136.427 | 136.5 | 0.999 |
| FloatScalar.EXP | 1024 | thrpt | 10 | 0.141 | ops/ms | 176.773 | 176.585 | 1.001 | 176.884 | 176.818 | 1 |
| FloatScalar.EXPM1 | 1024 | thrpt | 10 | 0.015 | ops/ms | 127.417 | 127.504 | 0.999 | 127.536 | 126.957 | 1.005 |
| FloatScalar.HYPOT | 1024 | thrpt | 10 | 0.006 | ops/ms | 162.621 | 162.834 | 0.999 | 162.766 | 162.404 | 1.002 |
| FloatScalar.LOG | 1024 | thrpt | 10 | 0.029 | ops/ms | 92.565 | 92.4 | 1.002 | 92.567 | 92.565 | 1 |
| FloatScalar.LOG10 | 1024 | thrpt | 10 | 0.005 | ops/ms | 70.792 | 70.774 | 1 | 70.789 | 70.799 | 1 |
| FloatScalar.LOG1P | 1024 | thrpt | 10 | 0.051 | ops/ms | 73.908 | 74.572 | 0.991 | 73.898 | 74.61 | 0.99 |
| FloatScalar.POW | 1024 | thrpt | 10 | 0.003 | ops/ms | 30.554 | 30.566 | 1 | 30.561 | 30.556 | 1 |
| FloatScalar.SIN | 1024 | thrpt | 10 | 0.248 | ops/ms | 109.954 | 109.57 | 1.004 | 109.873 | 109.842 | 1 |
| FloatScalar.SINH | 1024 | thrpt | 10 | 0.005 | ops/ms | 139.617 | 139.616 | 1 | 139.432 | 139.242 | 1.001 |
| FloatScalar.TAN | 1024 | thrpt | 10 | 0.007 | ops/ms | 44.327 | 44.16 | 1.004 | 44.478 | 44.401 | 1.002 |
| FloatScalar.TANH | 1024 | thrpt | 10 | 0.362 | ops/ms | 545.506 | 545.688 | 1 | 545.744 | 545.604 | 1 |
Double
data
| Benchmark | (size) | Mode | Cnt | Error | Units | Score +intrinsic (UseSVE=1) | Score -intrinsic | Improvement(UseSVE=1) | Score +intrinsic (UseSVE=0) | Score -intrinsic (UseSVE=0) | Improvement (UseSVE=0) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Double128Vector.ACOS | 1024 | thrpt | 10 | 0.005 | ops/ms | 117.913 | 67.641 | 1.743 | 117.977 | 67.793 | 1.74 |
| Double128Vector.ASIN | 1024 | thrpt | 10 | 0.006 | ops/ms | 145.789 | 68.392 | 2.132 | 145.518 | 68.181 | 2.134 |
| Double128Vector.ATAN | 1024 | thrpt | 10 | 0.004 | ops/ms | 87.644 | 42.752 | 2.05 | 87.544 | 43.136 | 2.029 |
| Double128Vector.ATAN2 | 1024 | thrpt | 10 | 0.003 | ops/ms | 60.414 | 26.235 | 2.303 | 60.182 | 26.313 | 2.287 |
| Double128Vector.CBRT | 1024 | thrpt | 10 | 0.001 | ops/ms | 52.679 | 30.617 | 1.721 | 52.657 | 30.69 | 1.716 |
| Double128Vector.COS | 1024 | thrpt | 10 | 0.004 | ops/ms | 71.501 | 47.165 | 1.516 | 71.612 | 47.114 | 1.52 |
| Double128Vector.COSH | 1024 | thrpt | 10 | 0.007 | ops/ms | 82.195 | 53.846 | 1.526 | 82.372 | 54.144 | 1.521 |
| Double128Vector.EXP | 1024 | thrpt | 10 | 0.012 | ops/ms | 216.471 | 58.192 | 3.72 | 217.261 | 58.271 | 3.728 |
| Double128Vector.EXPM1 | 1024 | thrpt | 10 | 0.007 | ops/ms | 95.372 | 48.037 | 1.985 | 95.799 | 47.954 | 1.998 |
| Double128Vector.HYPOT | 1024 | thrpt | 10 | 0.002 | ops/ms | 88.137 | 37.331 | 2.361 | 87.856 | 37.307 | 2.355 |
| Double128Vector.LOG | 1024 | thrpt | 10 | 0.038 | ops/ms | 98.972 | 41.669 | 2.375 | 99.046 | 41.723 | 2.374 |
| Double128Vector.LOG10 | 1024 | thrpt | 10 | 0.004 | ops/ms | 83.921 | 36.163 | 2.321 | 83.844 | 36.099 | 2.323 |
| Double128Vector.LOG1P | 1024 | thrpt | 10 | 0.006 | ops/ms | 86.526 | 36.291 | 2.384 | 86.592 | 36.148 | 2.395 |
| Double128Vector.POW | 1024 | thrpt | 10 | 0.001 | ops/ms | 34.439 | 21.817 | 1.579 | 34.373 | 21.618 | 1.59 |
| Double128Vector.SIN | 1024 | thrpt | 10 | 0.007 | ops/ms | 82.248 | 47.064 | 1.748 | 82.63 | 47.524 | 1.739 |
| Double128Vector.SINH | 1024 | thrpt | 10 | 0.005 | ops/ms | 80.27 | 53.565 | 1.499 | 80.404 | 53.438 | 1.505 |
| Double128Vector.TAN | 1024 | thrpt | 10 | 0.001 | ops/ms | 56.221 | 27.615 | 2.036 | 56.516 | 27.792 | 2.034 |
| Double128Vector.TANH | 1024 | thrpt | 10 | 0.011 | ops/ms | 64.979 | 83.143 | 0.782 | 65.652 | 82.771 | 0.793 |
| Double256Vector.ACOS | 1024 | thrpt | 10 | 0.455 | ops/ms | 179.103 | 112.49 | 1.592 | 87.833 | 88.651 | 0.991 |
| Double256Vector.ASIN | 1024 | thrpt | 10 | 0.691 | ops/ms | 212.368 | 112.884 | 1.881 | 88.369 | 88.365 | 1 |
| Double256Vector.ATAN | 1024 | thrpt | 10 | 0.008 | ops/ms | 120.882 | 55.861 | 2.164 | 49.106 | 48.979 | 1.003 |
| Double256Vector.ATAN2 | 1024 | thrpt | 10 | 0.006 | ops/ms | 98.254 | 33.362 | 2.945 | 30.514 | 30.556 | 0.999 |
| Double256Vector.CBRT | 1024 | thrpt | 10 | 0.016 | ops/ms | 89.053 | 43.473 | 2.048 | 38.255 | 37.885 | 1.01 |
| Double256Vector.COS | 1024 | thrpt | 10 | 0.03 | ops/ms | 119.208 | 65.874 | 1.81 | 57.119 | 57.033 | 1.002 |
| Double256Vector.COSH | 1024 | thrpt | 10 | 0.01 | ops/ms | 124.26 | 76.188 | 1.631 | 63.477 | 63.002 | 1.008 |
| Double256Vector.EXP | 1024 | thrpt | 10 | 0.048 | ops/ms | 390.922 | 88.453 | 4.42 | 72.249 | 72.248 | 1 |
| Double256Vector.EXPM1 | 1024 | thrpt | 10 | 0.017 | ops/ms | 121.844 | 66.475 | 1.833 | 57.431 | 57.36 | 1.001 |
| Double256Vector.HYPOT | 1024 | thrpt | 10 | 0.034 | ops/ms | 138.774 | 60.148 | 2.307 | 51.837 | 51.881 | 0.999 |
| Double256Vector.LOG | 1024 | thrpt | 10 | 0.073 | ops/ms | 165.474 | 55.445 | 2.984 | 48.7 | 48.571 | 1.003 |
| Double256Vector.LOG10 | 1024 | thrpt | 10 | 0.015 | ops/ms | 144.862 | 44.937 | 3.224 | 40.579 | 40.624 | 0.999 |
| Double256Vector.LOG1P | 1024 | thrpt | 10 | 0.21 | ops/ms | 151.807 | 46.401 | 3.272 | 40.943 | 41.158 | 0.995 |
| Double256Vector.POW | 1024 | thrpt | 10 | 0.003 | ops/ms | 53.228 | 25.144 | 2.117 | 21.862 | 21.852 | 1 |
| Double256Vector.SIN | 1024 | thrpt | 10 | 0.007 | ops/ms | 130.875 | 65.753 | 1.99 | 57.42 | 57.172 | 1.004 |
| Double256Vector.SINH | 1024 | thrpt | 10 | 0.004 | ops/ms | 120.093 | 76.13 | 1.577 | 63.283 | 62.823 | 1.007 |
| Double256Vector.TAN | 1024 | thrpt | 10 | 0.073 | ops/ms | 79.318 | 33.242 | 2.386 | 30.463 | 30.322 | 1.005 |
| Double256Vector.TANH | 1024 | thrpt | 10 | 1.633 | ops/ms | 152.914 | 154.668 | 0.989 | 107.585 | 7.441 | 14.458 |
| Double512Vector.ACOS | 1024 | thrpt | 10 | 0.1 | ops/ms | 122.582 | 121.073 | 1.012 | 123.136 | 22.485 | 5.476 |
| Double512Vector.ASIN | 1024 | thrpt | 10 | 0.099 | ops/ms | 123.678 | 122.482 | 1.01 | 121.616 | 22.78 | 5.339 |
| Double512Vector.ATAN | 1024 | thrpt | 10 | 0.14 | ops/ms | 61.939 | 61.928 | 1 | 61.821 | 62.013 | 0.997 |
| Double512Vector.ATAN2 | 1024 | thrpt | 10 | 0.014 | ops/ms | 38.638 | 38.541 | 1.003 | 38.668 | 38.697 | 0.999 |
| Double512Vector.CBRT | 1024 | thrpt | 10 | 0.024 | ops/ms | 49.685 | 49.667 | 1 | 49.674 | 49.634 | 1.001 |
| Double512Vector.COS | 1024 | thrpt | 10 | 0.046 | ops/ms | 74.125 | 73.99 | 1.002 | 74.462 | 72.102 | 1.033 |
| Double512Vector.COSH | 1024 | thrpt | 10 | 0.15 | ops/ms | 86.945 | 87.2 | 0.997 | 87.111 | 87.187 | 0.999 |
| Double512Vector.EXP | 1024 | thrpt | 10 | 0.507 | ops/ms | 100.955 | 101.43 | 0.995 | 101.213 | 1.336 | 75.758 |
| Double512Vector.EXPM1 | 1024 | thrpt | 10 | 0.017 | ops/ms | 75.648 | 75.012 | 1.008 | 75.632 | 75.293 | 1.005 |
| Double512Vector.HYPOT | 1024 | thrpt | 10 | 0.3 | ops/ms | 72.42 | 72.487 | 0.999 | 72.457 | 72.277 | 1.002 |
| Double512Vector.LOG | 1024 | thrpt | 10 | 0.021 | ops/ms | 64.729 | 64.613 | 1.002 | 64.584 | 64.43 | 1.002 |
| Double512Vector.LOG10 | 1024 | thrpt | 10 | 0.022 | ops/ms | 52.042 | 51.953 | 1.002 | 51.958 | 51.879 | 1.002 |
| Double512Vector.LOG1P | 1024 | thrpt | 10 | 0.103 | ops/ms | 52.239 | 52.169 | 1.001 | 52.161 | 52.176 | 1 |
| Double512Vector.POW | 1024 | thrpt | 10 | 0.008 | ops/ms | 25.488 | 25.473 | 1.001 | 25.462 | 25.461 | 1 |
| Double512Vector.SIN | 1024 | thrpt | 10 | 0.121 | ops/ms | 74.514 | 74.724 | 0.997 | 74.655 | 74.56 | 1.001 |
| Double512Vector.SINH | 1024 | thrpt | 10 | 0.216 | ops/ms | 86.568 | 86.488 | 1.001 | 86.673 | 86.855 | 0.998 |
| Double512Vector.TAN | 1024 | thrpt | 10 | 0.05 | ops/ms | 36.129 | 36.199 | 0.998 | 36.355 | 36.113 | 1.007 |
| Double512Vector.TANH | 1024 | thrpt | 10 | 0.125 | ops/ms | 172.425 | 171.657 | 1.004 | 171.701 | 71.727 | 2.394 |
| Double64Vector.ACOS | 1024 | thrpt | 10 | 0.125 | ops/ms | 29.916 | 30.242 | 0.989 | 30.232 | 30.135 | 1.003 |
| Double64Vector.ASIN | 1024 | thrpt | 10 | 0.008 | ops/ms | 30.677 | 30.58 | 1.003 | 30.396 | 30.524 | 0.996 |
| Double64Vector.ATAN | 1024 | thrpt | 10 | 0.038 | ops/ms | 19.561 | 19.526 | 1.002 | 19.446 | 19.456 | 0.999 |
| Double64Vector.ATAN2 | 1024 | thrpt | 10 | 0.008 | ops/ms | 15.376 | 15.669 | 0.981 | 15.412 | 15.369 | 1.003 |
| Double64Vector.CBRT | 1024 | thrpt | 10 | 0.004 | ops/ms | 13.943 | 13.943 | 1 | 13.873 | 13.89 | 0.999 |
| Double64Vector.COS | 1024 | thrpt | 10 | 0.012 | ops/ms | 20.677 | 20.698 | 0.999 | 20.632 | 20.652 | 0.999 |
| Double64Vector.COSH | 1024 | thrpt | 10 | 0.036 | ops/ms | 22.949 | 23.116 | 0.993 | 23.163 | 23.241 | 0.997 |
| Double64Vector.EXP | 1024 | thrpt | 10 | 0.104 | ops/ms | 23.424 | 23.521 | 0.996 | 23.605 | 23.622 | 0.999 |
| Double64Vector.EXPM1 | 1024 | thrpt | 10 | 0.157 | ops/ms | 22.301 | 22.353 | 0.998 | 21.973 | 22.166 | 0.991 |
| Double64Vector.HYPOT | 1024 | thrpt | 10 | 0.084 | ops/ms | 21.01 | 20.835 | 1.008 | 20.911 | 20.819 | 1.004 |
| Double64Vector.LOG | 1024 | thrpt | 10 | 0.041 | ops/ms | 18.265 | 18.291 | 0.999 | 18.192 | 18.21 | 0.999 |
| Double64Vector.LOG10 | 1024 | thrpt | 10 | 0.003 | ops/ms | 16.502 | 16.441 | 1.004 | 16.393 | 16.433 | 0.998 |
| Double64Vector.LOG1P | 1024 | thrpt | 10 | 0.009 | ops/ms | 16.815 | 16.862 | 0.997 | 16.792 | 16.833 | 0.998 |
| Double64Vector.POW | 1024 | thrpt | 10 | 0.012 | ops/ms | 11.814 | 11.82 | 0.999 | 11.865 | 11.877 | 0.999 |
| Double64Vector.SIN | 1024 | thrpt | 10 | 0.005 | ops/ms | 20.557 | 20.605 | 0.998 | 20.57 | 20.26 | 1.015 |
| Double64Vector.SINH | 1024 | thrpt | 10 | 0.074 | ops/ms | 23.133 | 23.23 | 0.996 | 23.048 | 23.069 | 0.999 |
| Double64Vector.TAN | 1024 | thrpt | 10 | 0.009 | ops/ms | 14.504 | 14.553 | 0.997 | 14.456 | 14.518 | 0.996 |
| Double64Vector.TANH | 1024 | thrpt | 10 | 0.12 | ops/ms | 31.304 | 31.226 | 1.002 | 31.4 | 31.267 | 1.004 |
| DoubleMaxVector.ACOS | 1024 | thrpt | 10 | 0.146 | ops/ms | 179.388 | 112.342 | 1.597 | 118.005 | 67.768 | 1.741 |
| DoubleMaxVector.ASIN | 1024 | thrpt | 10 | 0.169 | ops/ms | 212.342 | 114.107 | 1.861 | 145.676 | 68.143 | 2.138 |
| DoubleMaxVector.ATAN | 1024 | thrpt | 10 | 0.011 | ops/ms | 120.925 | 55.823 | 2.166 | 86.676 | 43.156 | 2.008 |
| DoubleMaxVector.ATAN2 | 1024 | thrpt | 10 | 0.006 | ops/ms | 98.345 | 33.604 | 2.927 | 60.45 | 26.383 | 2.291 |
| DoubleMaxVector.CBRT | 1024 | thrpt | 10 | 0.006 | ops/ms | 88.947 | 43.447 | 2.047 | 52.648 | 30.665 | 1.717 |
| DoubleMaxVector.COS | 1024 | thrpt | 10 | 0.023 | ops/ms | 119.164 | 65.718 | 1.813 | 71.619 | 47.145 | 1.519 |
| DoubleMaxVector.COSH | 1024 | thrpt | 10 | 0.005 | ops/ms | 124.342 | 75.967 | 1.637 | 82.447 | 54.084 | 1.524 |
| DoubleMaxVector.EXP | 1024 | thrpt | 10 | 0.042 | ops/ms | 390.767 | 87.918 | 4.445 | 216.207 | 58.342 | 3.706 |
| DoubleMaxVector.EXPM1 | 1024 | thrpt | 10 | 0.018 | ops/ms | 121.79 | 66.387 | 1.835 | 95.935 | 48.204 | 1.99 |
| DoubleMaxVector.HYPOT | 1024 | thrpt | 10 | 0.011 | ops/ms | 138.549 | 61.183 | 2.265 | 87.859 | 37.39 | 2.35 |
| DoubleMaxVector.LOG | 1024 | thrpt | 10 | 0.034 | ops/ms | 164.687 | 55.44 | 2.971 | 98.446 | 41.873 | 2.351 |
| DoubleMaxVector.LOG10 | 1024 | thrpt | 10 | 0.026 | ops/ms | 144.388 | 44.94 | 3.213 | 84.062 | 36.252 | 2.319 |
| DoubleMaxVector.LOG1P | 1024 | thrpt | 10 | 0.218 | ops/ms | 151.047 | 46.394 | 3.256 | 86.671 | 36.248 | 2.391 |
| DoubleMaxVector.POW | 1024 | thrpt | 10 | 0.004 | ops/ms | 53.241 | 25.251 | 2.108 | 34.371 | 21.58 | 1.593 |
| DoubleMaxVector.SIN | 1024 | thrpt | 10 | 0.003 | ops/ms | 130.708 | 65.451 | 1.997 | 83.012 | 47.547 | 1.746 |
| DoubleMaxVector.SINH | 1024 | thrpt | 10 | 0.007 | ops/ms | 120.654 | 75.693 | 1.594 | 80.603 | 53.586 | 1.504 |
| DoubleMaxVector.TAN | 1024 | thrpt | 10 | 0.062 | ops/ms | 80.045 | 33.268 | 2.406 | 56.48 | 27.723 | 2.037 |
| DoubleMaxVector.TANH | 1024 | thrpt | 10 | 0.99 | ops/ms | 154.334 | 153.197 | 1.007 | 65.401 | 82.937 | 0.789 |
| DoubleScalar.ACOS | 1024 | thrpt | 10 | 0.06 | ops/ms | 342.452 | 342.471 | 1 | 342.471 | 42.461 | 8.066 |
| DoubleScalar.ASIN | 1024 | thrpt | 10 | 0.09 | ops/ms | 353.739 | 354.47 | 0.998 | 352.211 | 54.513 | 6.461 |
| DoubleScalar.ATAN | 1024 | thrpt | 10 | 0.043 | ops/ms | 100.797 | 101.069 | 0.997 | 101.089 | 1.086 | 93.084 |
| DoubleScalar.ATAN2 | 1024 | thrpt | 10 | 0.025 | ops/ms | 62.29 | 62.283 | 1 | 62.218 | 62.227 | 1 |
| DoubleScalar.CBRT | 1024 | thrpt | 10 | 0.014 | ops/ms | 73.922 | 73.929 | 1 | 73.906 | 73.916 | 1 |
| DoubleScalar.COS | 1024 | thrpt | 10 | 0.204 | ops/ms | 117.948 | 117.806 | 1.001 | 117.856 | 17.763 | 6.635 |
| DoubleScalar.COSH | 1024 | thrpt | 10 | 0.016 | ops/ms | 141.113 | 141.083 | 1 | 141.749 | 40.659 | 3.486 |
| DoubleScalar.EXP | 1024 | thrpt | 10 | 0.008 | ops/ms | 189.453 | 188.923 | 1.003 | 189.555 | 89.348 | 2.122 |
| DoubleScalar.EXPM1 | 1024 | thrpt | 10 | 0.051 | ops/ms | 133.617 | 133.549 | 1.001 | 133.224 | 33.61 | 3.964 |
| DoubleScalar.HYPOT | 1024 | thrpt | 10 | 3.613 | ops/ms | 180.215 | 175.912 | 1.024 | 176.083 | 81.916 | 2.15 |
| DoubleScalar.LOG | 1024 | thrpt | 10 | 0.013 | ops/ms | 101.791 | 101.801 | 1 | 101.779 | 1.786 | 56.987 |
| DoubleScalar.LOG10 | 1024 | thrpt | 10 | 0.099 | ops/ms | 76.849 | 76.847 | 1 | 76.807 | 76.757 | 1.001 |
| DoubleScalar.LOG1P | 1024 | thrpt | 10 | 0.081 | ops/ms | 79.261 | 79.298 | 1 | 79.268 | 79.281 | 1 |
| DoubleScalar.POW | 1024 | thrpt | 10 | 0.002 | ops/ms | 31.915 | 31.925 | 1 | 31.919 | 31.92 | 1 |
| DoubleScalar.SIN | 1024 | thrpt | 10 | 0.167 | ops/ms | 118.087 | 117.722 | 1.003 | 118.292 | 18.243 | 6.484 |
| DoubleScalar.SINH | 1024 | thrpt | 10 | 0.012 | ops/ms | 143.901 | 143.803 | 1.001 | 144.228 | 43.922 | 3.284 |
| DoubleScalar.TAN | 1024 | thrpt | 10 | 0.047 | ops/ms | 46.513 | 46.584 | 0.998 | 46.503 | 46.778 | 0.994 |
| DoubleScalar.TANH | 1024 | thrpt | 10 | 0.204 | ops/ms | 552.603 | 561.965 | 0.983 | 561.941 | 61.802 | 9.093 |
Backup of previous test summary
NOTE:
Srcmeans implementation in this pr, i.e. without depenency on external sleef.Disabledmeans disable intrinsics by-XX:-UseVectorStubssystem_sleefmeans implementation in previous pr 18294, i.e. build and run jdk with depenency on external sleef.
Basically, the perf data below shows that
- this implementation has better performance than previous version in pr 18294,
- and both sleef versions has much better performance compared with non-sleef version.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8312425: [vectorapi] AArch64: Optimize vector math operations with SLEEF (Enhancement - P4)
Contributors
- Xiaohong Gong
<[email protected]>
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/18605/head:pull/18605
$ git checkout pull/18605
Update a local copy of the PR:
$ git checkout pull/18605
$ git pull https://git.openjdk.org/jdk.git pull/18605/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 18605
View PR using the GUI difftool:
$ git pr show -t 18605
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/18605.diff
Webrev
:wave: Welcome back mli! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.
@Hamlin-Li The following labels will be automatically applied to this pull request:
buildhotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.
Webrevs
- 10: Full - Incremental (6061c25d)
- 09: Full - Incremental (da65cfa5)
- 08: Full (b54fc863)
- 07: Full (fe4be2c6)
- 06: Full - Incremental (c279a3c2)
- 05: Full - Incremental (36415c34)
- 04: Full - Incremental (bd9c0931)
- 03: Full - Incremental (cbcd4634)
- 02: Full - Incremental (cd70f5a9)
- 01: Full - Incremental (34529ff1)
- 00: Full (3ab4795d)
/contributor add @XiaohongGong
@Hamlin-Li
Contributor Xiaohong Gong <[email protected]> successfully added.
Just a quick question after giving this a glance: My understanding was that the normal libsleef build set a lot of compiler options, e.g. disabling built-in maths etc. You don't seem to set any of these. Have you determined that they were not needed?
Just a quick question after giving this a glance: My understanding was that the normal libsleef build set a lot of compiler options, e.g. disabling built-in maths etc. You don't seem to set any of these. Have you determined that they were not needed?
Thanks for having a look and quick response. Good question.
Per disabling built-in maths, my understanding is that maybe we don't need to care about it, as this built-in math functions in compilers are only for scalar version, but we're using sleef's simd versions only which use vector intrinsics I think. e.g. in src/libm/sleefdp.c there is ENABLE_BUILTIN_MATH check, but in src/libm/sleefsimdsp.c there is no such check, so when generating inline header files, I assume its value (whether enable/disable built-in math) does not impact the generated simd functions. Please correct me if I'm understanding it wrongly.
For other compiler options, I tend to agree with you, but I'm not sure which might need, can you supply more information or point to some reference about normal libsleef build? BTW, what I refered to before was from sleef.org and sleef on github (including its github workflow).
Build libsleef using their cmake system and look at the compile command line. (You do this by VERBOSE=1 cmake IIRC). Then you can see what flags they are using. This is what I was referring to as "normal libsleef build". I noticed there were a lot of compiler flags. I can't say if they are needed or not. In most cases, if it compilers, it's fine, but in this case, I guess some flags can be crucial to really get the kind of performance you need, and it might not be easy to spot that something is wrong if you get them incorrect. I assume one way to make sure is to run microbenchmarks with an externally built libsleef and compare it with the one you build within the JDK. If there is no noticeable difference, then I guess it is fine.
Build libsleef using their cmake system and look at the compile command line. (You do this by
VERBOSE=1 cmakeIIRC). Then you can see what flags they are using. This is what I was referring to as "normal libsleef build". I noticed there were a lot of compiler flags. I can't say if they are needed or not. In most cases, if it compilers, it's fine, but in this case, I guess some flags can be crucial to really get the kind of performance you need, and it might not be easy to spot that something is wrong if you get them incorrect. I assume one way to make sure is to run microbenchmarks with an externally built libsleef and compare it with the one you build within the JDK. If there is no noticeable difference, then I guess it is fine.
Thanks for the clarification and good suggestion. I will verify it and update here later.
Just right now I have some trouble to get an aarch64 linux, I tried to get a graviton instance on AWS, but I failed to connect it when I create it. Previously I run all the test for correctness via qemu, but seems qemu is not for performance test. So I will update later when I get the environment ready.
If someone got the easy environment to verify the performance, it's very welcome. :)
Just a quick update, this pr introduces some performance regression compared with previous version (https://github.com/openjdk/jdk/pull/18294) for some math functions (e.g. Double256Vector.COS), and no regression for some others (e.g. Double256Vector.ACOS). I'm investigating.
Thank you for the update and for working on this in general.
I've started working on JDK-8329816, preparing the change for the SLEEF specific part of the change. Specifically, I'm currently planning on including the three SLEEF header files, the README and a legal/sleef.md file in that change. Let me know if you have any thoughts/concerns.
Also, just for my understanding, would love to understand your thoughts on the future here (I apologize if this was already discussed elsewhere):
It seem like SLEEF is (sort of) limited to linux at this point (the SLEEF README mentions that "Due to limited test capacities, SLEEF is currently only officially supported on Linux with gcc or llvm/clang." ). That same README does, however, indicate good test coverage on several architectures in addition to aarch64 (including x86_64, PPC, RISC-V). With that in mind, it looks like we could potentially use SLEEF for other architectures on linux in the future? And potentially additional operating systems as well?
Thank you for the update and for working on this in general.
I've started working on JDK-8329816, preparing the change for the SLEEF specific part of the change. Specifically, I'm currently planning on including the three SLEEF header files, the README and a legal/sleef.md file in that change. Let me know if you have any thoughts/concerns.
Thanks a lot, that's a great news. Please go ahead to integrate the files via JDK-8329816. :) Besides of the performance issue currently found out, I have no other concerns.
Also, just for my understanding, would love to understand your thoughts on the future here (I apologize if this was already discussed elsewhere):
It seem like SLEEF is (sort of) limited to linux at this point (the SLEEF README mentions that "Due to limited test capacities, SLEEF is currently only officially supported on Linux with gcc or llvm/clang." ). That same README does, however, indicate good test coverage on several architectures in addition to aarch64 (including x86_64, PPC, RISC-V). With that in mind, it looks like we could potentially use SLEEF for other architectures on linux in the future? And potentially additional operating systems as well?
There are more informantion at https://sleef.org/compile.xhtml, seems it could be formally supported on other OS's in the future, but I'm not sure about any of the plans. For riscv, sleef itself already support riscv, please check https://github.com/shibatch/sleef/pull/477. We are just waiting for gcc 14 which will enable support of sleef on riscv. Maybe others have more information could help to comment here. Thanks!
With that in mind, it looks like we could potentially use SLEEF for other architectures on linux in the future? And potentially additional operating systems as well?
Hi Mikael(@vidmik ) ! :)
Thanks for looking into the legal stuff! We are pushing for this as we can leverage these changes when adding sleef to risc-v.
Cross-fingers about legal!
/Robbin
Thank you for the update and for working on this in general. I've started working on JDK-8329816, preparing the change for the SLEEF specific part of the change. Specifically, I'm currently planning on including the three SLEEF header files, the README and a legal/sleef.md file in that change. Let me know if you have any thoughts/concerns.
Thanks a lot, that's a great news. Please go ahead to integrate the files via JDK-8329816. :) Besides of the performance issue currently found out, I have no other concerns.
I found the root cause of the performance regression, and have a draft solution for it, I'm running a thorough benchmark to see if it works for all sleef functions we use in jdk. So, basically this solution is good.
Nice work, Hamlin and Xiaohong. I'm glad to see progress on incorporating SLEEF library into the JDK. (Somehow I missed all previous PRs you posted before.)
I'm not a lawyer, so won't comment on 3rd party library sources under Boost Software License in OpenJDK.
From engineering perspective, I believe that bundling vector math library with the JDK is the right thing to do, but it doesn't imply the sources should be part of JDK. There are already examples of optional dependencies on external native libraries in HotSpot (e.g., hsdis tool w/ binutils, capstone, and llvm backends).
Speaking of HotSpot-specific changes, IMO it desperately needs a cross-platform interface between vector math libraries and JVM. Most of the changes in StubGenerator are library-specific and are irrelevant in the context of the JVM. I do see that you try to replicate SVML logic, but SVML support didn't set a precedent to follow here.
For background, SVML stubs were initially contributed to Panama as assembly stubs statically linked into libjvm.so. It was acceptable for experimentation purposes, but not for mainline JDK (even for functionality in incubating module). The compromise was to bundle the stubs as a dynamic library and link against them. And that's how it stayed until today.
IMO in order to get SLEEF in, the interaction between JVM and backend native library should be unified. And it should affect both SLEEF and SVML stubs.
In particular, I'd like to see all those named lookups to go away from the JVM code. A single call into the library during compiler/VM initialization can produce a fully populated table of function pointers (StubRoutines::_vector_[fd]_math now) for C2 to use later.
FTR there were other alternatives discussed (use Panama FFI or rewrite the stubs in Vector API itself). The latter (complete rewrite) is still something for a distant future, but Foreign Function API is public API now, so once it supports vector calling conventions, it should become fully capable of satisfying Vector API implementation needs to interact with vector math library.
IMO that what we should keep in mind when designing new interface. There's no inherent need to keep vector stub support in the JVM. Once Foreign Function API gains vector support, it should be replaced with a pure Java FFI-based implementation.
Nice work, Hamlin and Xiaohong. I'm glad to see progress on incorporating SLEEF library into the JDK. (Somehow I > From engineering perspective, I believe that bundling vector math library with the JDK is the right thing to do, but it doesn't imply the sources should be part of JDK. There are already examples of optional dependencies on external native libraries in HotSpot (e.g., hsdis tool w/ binutils, capstone, and llvm backends).
No, it doesn't imply that the sources should be part of JDK, but practical reasons to do with the way that OpenJDK is built and shipped by various parties strongly suggests that we should integrate the SLEEF library into the JDK source tree. If we don't, there will be skew between OpenJDK versions shipped by different vendors. Also, I believe that there is less work for all of us if we integrate rather than having communicate to everyone building the JDK. And finally, Mark Reinhold has stated that the JDK is not downstream of any other project.
Hey, @vidmik I've fixed the performance issue, and update the sleef inline headers and README. It's good for you to integrate these files via JDK-8329816.
Thanks everyone for discussion about the direction (integrate source or lib).
We did have some implementation for integrating sleef lib into jdk, but seems previously the most strong opinion is to integrate the sleef source into jdk. I know there are cons and pros for every solution, but I will stick to current solution unless everyone can reach another agreement.
I've also updated the pr description with performance data, it shows that
- this implementation has better performance than previous version in https://github.com/openjdk/jdk/pull/18294,
- and both sleef versions has much better performance compared with non-sleef version.
Based on these 2 pr (https://github.com/shibatch/sleef/pull/537, https://github.com/shibatch/sleef/pull/536), there is no necessary code change in sleef files anymore.
Hey @vidmik , I just added inline header file for riscv64 (gcc14.1 was just released yesterday), hope to help avoid go through the legal process for arm and riscv header files separately.
For the full implementation on riscv64, I will put it in another pr.
@Hamlin-Li This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
in progress...
@Hamlin-Li this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout sleef-aarch64-integrate-source
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
Please tell me the exact command line you used to produce these benchmark results.
Hi @Hamlin-Li , thanks for your work.
I tried to run benchmarks, FloatMaxVector and DoubleMaxVector, on different aarch64 machines.
Here is the data I got for TANH, with args -i 5 -f 3 -wi 3 -foe true -jvmArgs -Xms4g -Xmx4g -XX:+AlwaysPreTouch -XX:ObjectAlignmentInBytes=16:
// NEON machine
Benchmark (size) Mode Cnt Units Perf gain
DoubleMaxVector.TANH 1024 thrpt 15 ops/ms -38%
FloatMaxVector.TANH 1024 thrpt 15 ops/ms -26%
// 128-bit sve machine (TANH also implemented with NEON)
Benchmark (size) Mode Cnt Units Perf gain
DoubleMaxVector.TANH 1024 thrpt 15 ops/ms -19%
FloatMaxVector.TANH 1024 thrpt 15 ops/ms ~00%
The performance of vector stubs for TANH looks not quite stable on different NEON machines. Since this pr does not provide TANH interface on sve machines for the performance regression, how about also disabling it on NEON for the same reason? WDYT?
Thanks.
@fg1417 Thanks for testing. Sure, I can do that based on your test result, I will restart work on it after https://github.com/openjdk/jdk/pull/19185 is integrated.
@theRealAph I lost my previous vm, so currently I only generate the header files, but did not test performance since last time, I don't remember I had special vm options passed in at that time.
I have now wasted two hours trying to duplicate your results.
I need you to write here the exact command line that produced your numbers above, along with the full configure and build options you used.
I also had problems with javac running out of heap space, which was very odd. I fixed it with this:
diff --git a/make/autoconf/boot-jdk.m4 b/make/autoconf/boot-jdk.m4
index 8d272c28ad5..617ccfd8fff 100644
--- a/make/autoconf/boot-jdk.m4
+++ b/make/autoconf/boot-jdk.m4
@@ -470,7 +470,7 @@ AC_DEFUN_ONCE([BOOTJDK_SETUP_BOOT_JDK_ARGUMENTS],
# Maximum amount of heap memory.
JVM_HEAP_LIMIT_32="768"
# Running a 64 bit JVM allows for and requires a bigger heap
- JVM_HEAP_LIMIT_64="1600"
+ JVM_HEAP_LIMIT_64="6400"