feat(meq): use simd to improve performance of meq
[!WARNING] This is an experimental PR, not to be merged in, but to start a conversation about using simd when it lands in stable ~ or use a different library to access arch-specific simd instructions. While doing the zkvm benchmarks with large meq's, it results in a lot of execution cycles and proving time. This enhancement should improve things there.
[Link to related issue(s) here, if any]
[Short description of the changes.] Usage of cpu intrinsics (neon, avx2, avx512f) to improve memory compare performance. detected upto 40% on neon, and 20-30% using avx2.
avx2 and avx512f can probably be optimised in terms of the masking
using divan as the benching lib -
meq_performance_divan_plain fastest │ slowest │ median │ mean │ samples │ iters
╰─ meq_performance │ │ │ │ │
├─ 100 40.31 ns │ 61.66 µs │ 41.31 ns │ 928.8 ns │ 100 │ 100
├─ 2000 50.74 ns │ 51.72 ns │ 51.39 ns │ 51.26 ns │ 100 │ 12800
├─ 4000 80.03 ns │ 112.5 ns │ 81.34 ns │ 82.78 ns │ 100 │ 6400
├─ 8000 126.9 ns │ 133.4 ns │ 129.5 ns │ 129.5 ns │ 100 │ 3200
├─ 16000 220.6 ns │ 1.218 µs │ 228.4 ns │ 237.8 ns │ 100 │ 1600
╰─ 32000 413.3 ns │ 582.6 ns │ 423.7 ns │ 424.8 ns │ 100 │ 1600
meq_performance_divan_optimized fastest │ slowest │ median │ mean │ samples │ iters
╰─ meq_performance │ │ │ │ │
├─ 100 40.17 ns │ 57.95 µs │ 41.17 ns │ 872 ns │ 100 │ 100
├─ 2000 36.28 ns │ 40.19 ns │ 36.93 ns │ 37.37 ns │ 100 │ 12800
├─ 4000 49.3 ns │ 53.86 ns │ 49.62 ns │ 49.68 ns │ 100 │ 12800
├─ 8000 75.34 ns │ 87.06 ns │ 77.94 ns │ 77.41 ns │ 100 │ 6400
├─ 16000 129.3 ns │ 947.1 ns │ 131.9 ns │ 140.1 ns │ 100 │ 1600
╰─ 32000 230.9 ns │ 384.5 ns │ 236.1 ns │ 236.7 ns │ 100 │ 1600
Checklist
- [ ] Breaking changes are clearly marked as such in the PR description and changelog
- [ ] New behavior is reflected in tests
- [ ] If performance characteristic of an instruction change, update gas costs as well or make a follow-up PR for that
- [ ] The specification matches the implemented behavior (link update PR if changes are needed)
Before requesting review
- [ ] I have reviewed the code myself
- [ ] I have created follow-up issues caused by this PR and linked them here
After merging, notify other teams
[Add or remove entries as needed]
- [ ] Rust SDK
- [ ] Sway compiler
- [ ] Platform documentation (for out-of-organization contributors, the person merging the PR will do this)
- [ ] Someone else?
@rymnc What do we want to do with this PR?
I'm happy to land it when simd becomes stable :)
also, it is just to demonstrate how we can get better performance on vm level for memory opcodes