RandomX
RandomX copied to clipboard
RandomX v2 virtual machine changes
- CFROUND becomes conditional with a 1/16 chance of writing into fprc
- F and E registers are mixed together with AES instead of XOR
This PR is incomplete. Currently, only the X86 and portable versions work, hardware AES is needed with JIT and the changes are hardcoded. But it's enough to run some benchmarks.
- [x] CFROUND changes in the interpreter
- [x] CFROUND changes in X86 JIT compiler
- [ ] CFROUND changes in A64 JIT compiler
- [ ] V1/V2 selectable at runtime
- [x] New portable intrinsics
- [x] New X86 intrinsics
- [ ] New ARM64 intrinsics
- [ ] New PowerPC intrinsics
- [x] Support for soft AES in X86 JIT compiler
- [ ] Support for soft AES in A64 JIT compiler
- [ ] Update documentation
- [ ] Update tests
Tested on Ryzen 7 1700 (Zen 1) with 2 threads running on the same core:
Algorithm | Hashrate |
---|---|
RandomX | 926.4 h/s |
RandomX + CFROUND tweak | 1004.2 h/s |
RandomX v2 (CFROUND and AES tweaks) | 1003.8 h/s |
Summary for those who didn't read discussions on IRC:
- CFROUND tweak makes RandomX more efficient (8.4% hashrate increase on Zen 1, expected 5-10% hashrate increase on other AMD CPUs)
- AES tweak doubles the amount of AES computations per hash without hurting the hashrate (it uses the gap in RandomX main loop where CPU was sitting idle, waiting for scratchpad data)
- AES tweak also introduces AES in the main RandomX loop which makes it harder for specialized hardware to get away with just a dedicated circuit for scratchpad intialization - AES must be implemented as a part of RandomX VM and work with RandomX VM's registers
- AES tweak also improves data entropy (makes it more random) before it's written to the scratchpad
RandomX V2 tests git clone https://github.com/tevador/RandomX.git cd RandomX mkdir build && cd build cmake -DARCH=native .. make
./randomx-benchmark --mine --jit --largePages --threads 2 --affinity 3 --init 16
cd .. git pull origin pull/274/head cd buid cmake -DARCH=native .. make
./randomx-benchmark --mine --jit --largePages --threads 2 --affinity 3 --init 16
threadripper 3970x Standard Performance: 1191.67 hashes per second
New: Performance: 1250.43 hashes per second
5900x Old Performance: 1525.68 hashes per second
New Performance: 1645.73 hashes per second
3900x Old Performance: 1454.24 hashes per second
New Performance: 1561.44 hashes per second
model name : Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz Old Performance: 375.845 hashes per second
New Performance: 374.699 hashes per second
model name : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz Old Performance: 1474.77 hashes per second
New Performance: 1472.4 hashes per second
Ryzen 7 1700 in single thread mode: old 664.3 h/s, new 736.2 h/s.
model name : Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
(this time Unthrottled)
Old Performance: 1250.65 hashes per second
New: Performance: 1225.8 hashes per second
--mine --jit --largePages --threads 1 --affinity 1 --init 16
Single thread:
Old: Performance: 655.031 hashes per second
New: Performance: 641.192 hashes per second
model name : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz Single thread: Old Performance: 743.708 hashes per second
New: Performance: 739.415 hashes per second
Per @SChernykh suggestion, ran tests 5 times and picked highest: (for i7-7700K) Old Performance: 747.852 hashes per second
New 746.367 hashes per second on v2
I implemented software AES support in the JIT compiler. To test with software AES, the following line needs to be changed:
https://github.com/tevador/RandomX/blob/356b9ff22c81d04a37bcd75cc3b60e47db7c11cc/src/common.hpp#L125
Measured with Ryzen 3700X: ./randomx-benchmark --jit --verify --softAes --largePages
Old: 15.2843 ms per hash New: 17.117 ms per hash
(Ran 5x and took the lowest result.)
So it seems there is a 10-11% performance hit for soft AES systems when doing light verification.
Ryzen 9 7950X: randomx-benchmark --mine --jit --largePages --threads 2 --affinity 3 --init 32
Old 1635 h/s New 1763 h/s
And no measurable hashrate difference with and without AES tweak.
@tevador Do you need help with aarch64? I can do it because I wrote that code originally, so I'm more familiar with it.
Yes, it would be great if you could do the changes in the ARM64 JIT. But please wait, I realized the JitCompiler interface needs to be changed because the class cannot be a template. I'm working on a solution that would not cause cascading changes to other classes and it's a bit tricky. But I think updating the ARM assembly code should be safe for you to do now.
Yes, I will only implement CFROUND and AES changes for A64 JIT compiler.
Yes, it would be great if you could do the changes in the ARM64 JIT. But please wait, I realized the JitCompiler interface needs to be changed because the class cannot be a template. I'm working on a solution that would not cause cascading changes to other classes and it's a bit tricky. But I think updating the ARM assembly code should be safe for you to do now.
@tevador My WIP is here: https://github.com/SChernykh/RandomX/commits/v2 I think I found a solution for JitCompiler problem you mentioned. And I only have soft AES left to implement.
macOS ARM
v2: 445.702 hashes per second v1: 424.601 hashes per second
@selsta can you run each test multiple times and take the highest number for v1 and v2? ARM CPUs never run at the same speed in most devices because of power saving.
I did run it multiple times, while there was some variation v2 was always faster by around 15-20h/s.
Hmm, that's interesting. So Apple silicon also gets a boost (but only 5%). Is it Apple M1 or M2?
M1 Pro (8 performance cores, 2 efficiency cores)
@tevador aarch64 is ready to be added: https://github.com/SChernykh/RandomX/tree/v2
@tevador I squashed my commits, you can just cherry-pick https://github.com/SChernykh/RandomX/commit/67d1340856f44130a950b47dd693c7c49907c167 into your PR.
I can't wait for the RandomX V2 :heart:
@tevador Do you plan to finish it soon? What is left to be done?
@tevador thank you for your work on the previous and this new version of RandomX!
We're working on decentralized cloud and plan to use RandomX for CPU capacity proof of every core of a capacity provider. Looks like RandomX is the only existing ASIC and GPU resistant solution for this task. We want to launch our network in the nearest future and kinda dependent on this PR. Are there any time estimates for it? How stable is it now and can you recommend to use it for at least x86?