Full SME(1) instruction support and STREAMING Groups
This PR implements all available SME (version 1) instructions that are contained within LLVM 14.0.5. Specifically, this is Version 2021-06 of the Armv9-A A64 ISA.
No FP16 or BF16 instructions have been supported due to lacking C++17 types. All Quad-Word instruction variants have been emulated using 64-bit data-types.
In addition to this, new STREAMING_SVE and STREAMING_PREDICATE groups have been introduced (along with corresponding decode logic) to allow for a different pipeline / latency configuration for these instructions when SVE Streaming Mode (the context mode which SME instructions are executed in) is enabled. This can allow for a co-processor style implementation of SME to be implemented within SimEng; with additional latency / reduced throughput being configured to mimic an offload penalty, and different execution or LD/STR hardware being modelled for said co-processor compared to the main core.
- [x] Add STREAMING Group support
- [x] Add execution logic and regression tests for all missing SME instructions
#rerun tests
Now outdated as STREAMING groups logic removed which was the only cause for slowdown.
See below for this PR's performance compared to dev (times averaged over 5 runs):
| Benchmark | dev Time (ms) |
dev StdDev |
This PR Time (ms) | % diff to dev |
This PR StdDev | |
|---|---|---|---|---|---|---|
| CloverLeaf serial gcc8.3.0 armv8.4 | 13194.4 | 60.1 | 13557.0 | 2.71% | 132.52 | |
| CloverLeaf serial gcc9.3.0 armv8.4 | 13050.6 | 102.7 | 13580.2 | 3.98% | 84.94 | |
| CloverLeaf serial gcc10.3.0 armv8.4 | 13290.4 | 47.9 | 13623.0 | 2.47% | 44.06 | |
| CloverLeaf serial armclang20 armv8.4 | 11804.4 | 39.1 | 12343.2 | 4.46% | 77.05 | |
| CloverLeaf openmp gcc8.3.0 armv8.4 | 17509.4 | 161.5 | 17889.8 | 2.15% | 65.83 | |
| CloverLeaf openmp gcc9.3.0 armv8.4 | 17584.4 | 182.0 | 17995.2 | 2.31% | 152.27 | |
| CloverLeaf openmp gcc10.3.0 armv8.4 | 17119.8 | 61.3 | 17651.4 | 3.06% | 79.05 | |
| CloverLeaf openmp armclang20 armv8.4 | 15820.8 | 95.4 | 16211.0 | 2.44% | 83.58 | |
| miniBUDE openmp gcc8.3.0 armv8.4 | 24691.2 | 52.3 | 24505.6 | -0.75% | 276.93 | |
| miniBUDE openmp gcc9.3.0 armv8.4 | 24500.0 | 175.6 | 24412.8 | -0.36% | 155.77 | |
| miniBUDE openmp gcc10.3.0 armv8.4 | 24438.0 | 146.7 | 24260.6 | -0.73% | 77.47 | |
| miniBUDE openmp armclang20 armv8.4 | 22725.2 | 150.0 | 22343.4 | -1.69% | 67.39 | |
| STREAM serial gcc8.3.0 armv8.4 | 7378.0 | 40.3 | 7769.8 | 5.17% | 29.84 | |
| STREAM serial gcc9.3.0 armv8.4 | 7380.4 | 48.6 | 7722.6 | 4.53% | 68.62 | |
| STREAM serial gcc10.3.0 armv8.4 | 7530.6 | 71.7 | 7632.6 | 1.35% | 39.53 | |
| STREAM serial armclang20 armv8.4 | 8948.0 | 70.6 | 8317.4 | -7.30% | 36.88 | |
| STREAM openmp gcc8.3.0 armv8.4 | 11552.6 | 139.5 | 12020.4 | 3.97% | 111.61 | |
| STREAM openmp gcc9.3.0 armv8.4 | 11737.0 | 133.1 | 11855.8 | 1.01% | 48.96 | |
| STREAM openmp gcc10.3.0 armv8.4 | 11357.4 | 36.4 | 11768.0 | 3.55% | 95.17 | |
| STREAM openmp armclang20 armv8.4 | 12701.0 | 227.5 | 12309.0 | -3.13% | 87.32 | |
| TeaLeaf 2D serial gcc8.3.0 armv8.4 | 13964.4 | 41.8 | 13605.8 | -2.60% | 42.13 | |
| TeaLeaf 2D serial gcc9.3.0 armv8.4 | 13976.2 | 40.8 | 13553.6 | -3.07% | 88.90 | |
| TeaLeaf 2D serial gcc10.3.0 armv8.4 | 14231.0 | 92.2 | 13961.2 | -1.91% | 109.20 | |
| TeaLeaf 2D serial armclang20 armv8.4 | 25691.8 | 86.2 | 24628.8 | -4.22% | 199.33 | |
| TeaLeaf 2D openmp gcc8.3.0 armv8.4 | 20085.2 | 88.6 | 20070.4 | -0.07% | 110.76 | |
| TeaLeaf 2D openmp gcc9.3.0 armv8.4 | 19980.2 | 79.3 | 20492.8 | 2.53% | 146.48 | |
| TeaLeaf 2D openmp gcc10.3.0 armv8.4 | 19684.8 | 88.1 | 19522.4 | -0.83% | 100.20 | |
| TeaLeaf 2D openmp armclang20 armv8.4 | 58068.6 | 251.6 | 61880.2 | 6.36% | 284.36 | |
| TeaLeaf 3D serial gcc8.3.0 armv8.4 | 15853.0 | 128.6 | 15818.2 | -0.22% | 57.76 | |
| TeaLeaf 3D serial gcc9.3.0 armv8.4 | 16483.8 | 58.3 | 16334.6 | -0.91% | 87.93 | |
| TeaLeaf 3D serial gcc10.3.0 armv8.4 | 16839.8 | 86.0 | 16521.4 | -1.91% | 28.94 | |
| TeaLeaf 3D serial armclang20 armv8.4 | 23052.2 | 157.0 | 22959.8 | -0.40% | 134.67 | |
| TeaLeaf 3D openmp gcc8.3.0 armv8.4 | 26103.0 | 145.5 | 26294.8 | 0.73% | 190.12 | |
| TeaLeaf 3D openmp gcc9.3.0 armv8.4 | 26203.6 | 103.0 | 27278.8 | 4.02% | 239.28 | |
| TeaLeaf 3D openmp gcc10.3.0 armv8.4 | 26068.2 | 278.0 | 26129.6 | 0.24% | 112.81 | |
| TeaLeaf 3D openmp armclang20 armv8.4 | 45312.4 | 179.0 | 48379.4 | 6.55% | 136.36 | |
| CloverLeaf serial gcc8.3.0 armv8.4+sve | 12763.0 | 89.1 | 13372.0 | 4.66% | 59.14 | |
| CloverLeaf serial gcc9.3.0 armv8.4+sve | 12675.4 | 52.4 | 13300.4 | 4.81% | 134.66 | |
| CloverLeaf serial gcc10.3.0 armv8.4+sve | 12665.4 | 88.7 | 13086.4 | 3.27% | 63.11 | |
| CloverLeaf serial armclang20 armv8.4+sve | 12512.8 | 79.5 | 12963.4 | 3.54% | 71.92 | |
| CloverLeaf openmp gcc8.3.0 armv8.4+sve | 16973.8 | 119.5 | 17630.2 | 3.79% | 197.66 | |
| CloverLeaf openmp gcc9.3.0 armv8.4+sve | 17076.6 | 132.9 | 17460.8 | 2.22% | 53.09 | |
| CloverLeaf openmp gcc10.3.0 armv8.4+sve | 16814.4 | 96.4 | 17264.4 | 2.64% | 76.24 | |
| CloverLeaf openmp armclang20 armv8.4+sve | 16436.8 | 82.2 | 16844.2 | 2.45% | 98.85 | |
| miniBUDE openmp gcc8.3.0 armv8.4+sve | 9745.6 | 125.8 | 10291.4 | 5.45% | 90.47 | |
| miniBUDE openmp gcc9.3.0 armv8.4+sve | 9172.0 | 41.3 | 10081.6 | 9.45% | 64.37 | |
| miniBUDE openmp gcc10.3.0 armv8.4+sve | 9180.0 | 36.6 | 10054.0 | 9.09% | 61.30 | |
| miniBUDE openmp armclang20 armv8.4+sve | 9746.6 | 63.0 | 10098.8 | 3.55% | 85.55 | |
| STREAM serial gcc8.3.0 armv8.4+sve | 3915.0 | 18.9 | 4139.4 | 5.57% | 15.92 | |
| STREAM serial gcc9.3.0 armv8.4+sve | 3919.4 | 16.7 | 4139.2 | 5.46% | 18.14 | |
| STREAM serial gcc10.3.0 armv8.4+sve | 3862.0 | 29.9 | 4086.2 | 5.64% | 23.04 | |
| STREAM serial armclang20 armv8.4+sve | 2550.2 | 3.7 | 2593.4 | 1.68% | 17.33 | |
| STREAM openmp gcc8.3.0 armv8.4+sve | 7977.4 | 32.4 | 8196.2 | 2.71% | 38.70 | |
| STREAM openmp gcc9.3.0 armv8.4+sve | 7987.4 | 87.9 | 8265.6 | 3.42% | 12.76 | |
| STREAM openmp gcc10.3.0 armv8.4+sve | 7999.2 | 69.2 | 8051.0 | 0.65% | 34.07 | |
| STREAM openmp armclang20 armv8.4+sve | 6836.0 | 10.0 | 6990.8 | 2.24% | 35.39 | |
| TeaLeaf 2D serial gcc8.3.0 armv8.4+sve | 14022.8 | 99.5 | 13579.0 | -3.22% | 59.23 | |
| TeaLeaf 2D serial gcc9.3.0 armv8.4+sve | 13996.4 | 63.8 | 13610.4 | -2.80% | 64.07 | |
| TeaLeaf 2D serial gcc10.3.0 armv8.4+sve | 14362.6 | 59.8 | 13831.0 | -3.77% | 65.83 | |
| TeaLeaf 2D serial armclang20 armv8.4+sve | 9835.2 | 75.5 | 9782.0 | -0.54% | 113.49 | |
| TeaLeaf 2D openmp gcc8.3.0 armv8.4+sve | 19885.8 | 62.1 | 20026.2 | 0.70% | 69.48 | |
| TeaLeaf 2D openmp gcc9.3.0 armv8.4+sve | 20028.2 | 143.4 | 20322.0 | 1.46% | 111.62 | |
| TeaLeaf 2D openmp gcc10.3.0 armv8.4+sve | 19695.6 | 83.6 | 19575.4 | -0.61% | 38.66 | |
| TeaLeaf 2D openmp armclang20 armv8.4+sve | 57176.4 | 405.7 | 59327.4 | 3.69% | 324.77 | |
| TeaLeaf 3D serial gcc8.3.0 armv8.4+sve | 13828.8 | 50.1 | 14023.6 | 1.40% | 51.09 | |
| TeaLeaf 3D serial gcc9.3.0 armv8.4+sve | 13901.6 | 36.4 | 14065.6 | 1.17% | 35.20 | |
| TeaLeaf 3D serial gcc10.3.0 armv8.4+sve | 14043.8 | 58.0 | 14203.2 | 1.13% | 103.32 | |
| TeaLeaf 3D serial armclang20 armv8.4+sve | 22478.8 | 138.4 | 22850.6 | 1.64% | 51.21 | |
| TeaLeaf 3D openmp gcc8.3.0 armv8.4+sve | 23927.6 | 73.3 | 24201.0 | 1.14% | 94.22 | |
| TeaLeaf 3D openmp gcc9.3.0 armv8.4+sve | 23638.8 | 119.3 | 24663.4 | 4.24% | 138.94 | |
| TeaLeaf 3D openmp gcc10.3.0 armv8.4+sve | 23550.4 | 130.2 | 24060.4 | 2.14% | 31.31 | |
| TeaLeaf 3D openmp armclang20 armv8.4+sve | 48104.8 | 253.4 | 50293.2 | 4.45% | 319.78 |
I assume that it has been checked, but a reminder of the SME loops comment in #415 (review)
This PR doesn't add support for many of those loops (most LD/STR instructions still missing as need newer ISA version than this PR targets). Without access to physical hardware (i.e. an Apple M4) where we can generate small tests and verify the output against our current regression tests for these instructions, there currently aren't any binaries that we could run to ensure validity of most instructions added in this PR.
For PR #439, the loops would be able to be used (as that PR targets them) but still need to be done privately.