Add (opt-in) vDSO support for `getrandom`?
This Go issue can be used as a reference: https://github.com/golang/go/issues/69577
I'm not familar with this feature yet, so I'd appreciate any guidance from anyone who is is.
At a quick glance at the glibc patch, it has special handling in pthread_create and fork. If rustix has its own implementation, it's not immediately clear how rustix should handle this, since it isn't in charge of thread creation and cleanup.
One option would be for rustix to just have a minimal wrapper around the vDSO function, rather than providing a full-service getrandom, and just require the main state management code to live elsewhere. That's relatively easy to do, and I imagine something like origin would be able to do the state management because it can assume it's in charge of all thread and process creation, however it's not clear that anything else could.
I have a PoC implementation at https://github.com/SchrodingerZhu/useless/tree/main/crates/vdso-rng.
I really hoped to use rustix but this crate does not expose the low-level access to the vDSO and auxv. So I wrote the parser on my own.
Is it possible to provide such APIs?
I have a PoC implementation at https://github.com/SchrodingerZhu/useless/tree/main/crates/vdso-rng.
Thanks for the work. I tested the performance of vDSO getrandom using your library but with some benchmark code modification for fair comparison[^1]. Here is the result[^2] on Ryzen 7 5700G:
getrandom 4B, flags=0: vdso 17.2ns, syscall 295ns, user-impl-with-reseed 8.9ns getrandom 64KiB, flags=0: vdso 85.4μs, syscall 122μs, user-impl-with-reseed 25.4μs
The vDSO is ~16x faster than syscall for tiny buffers, ~62% faster than syscall for relatively large buffers. But in all cases, vDSO is still much slower than user ChaCha20 implementation even with auto-reseeding using getrandom syscall. vDSO impl is probably "more secure" since it knows the best time to reseed and that frequency may be higher than user-impl's default 64KiB. But it's hard to tell the difference just by looking the output.
IMO, it's good for rustix to support vDSO, but it's probably not good enough to be called frequently for regular non-security-critical use. For infrequent reseeding purpose of user RNGs, switching to vDSO should have negligible performance difference either (25.4μs vs 25.1μs per 64KiB).
[^1]: Linux kernel use 20-round ChaCha instead of the default algorithm 12-round ChaCha in rand::rngs::ThreadRng. I use the rand_chacha::ChaCha20Core, wrapped in ReseedingRng of syscall getrandom, with default reseeding period of 64KiB for the benchmark.
[^2]: I double checked the perf report, and most of time is indeed spent in the relevant call (vDSO call). The library does not introduce noticeable overhead.
Some detail: Userspace reseeding has problems when you do a virtual machine fork, it can be hard for the generator to sense the fork and thus result in at least a short sequence of identical results. Presumably, the vDSO getrandom has a way to identify such events.
(I am personally not a security person.)
Some detail: Userspace reseeding has problems when you do a virtual machine fork, it can be hard for the generator to sense the fork and thus result in at least a short sequence of identical results. Presumably, the vDSO getrandom has a way to identify such events.
Yeah, vDSO getrandom is still a good substitute for getrandom syscall, if not considering user RNGs[^1]. For security-critical cases like key generation, the syscall is infrequent enough and not a bottleneck comparing to other cryptographic operations, rendering vDSO only a feature that is good to have but benefits little.
[^1]: It's said that, "the security of a kernel CSPRNG with the speed of a userspace CSPRNG", while the benchmark disagrees.
To add my own benchmarks (Ryzen 5800X, Linux 6.15.4-200.fc42.x86_64):
$ cargo bench --bench generators
# 1kiB reads
random_bytes/chacha20 time: [464.27 ns 465.36 ns 466.72 ns]
thrpt: [2.0434 GiB/s 2.0493 GiB/s 2.0541 GiB/s]
random_bytes/small time: [167.64 ns 168.09 ns 168.66 ns]
thrpt: [5.6544 GiB/s 5.6737 GiB/s 5.6889 GiB/s]
random_bytes/os time: [1.4163 µs 1.4175 µs 1.4189 µs]
thrpt: [688.26 MiB/s 688.91 MiB/s 689.54 MiB/s]
random_bytes/vdso time: [1.4227 µs 1.4234 µs 1.4243 µs]
thrpt: [685.64 MiB/s 686.06 MiB/s 686.42 MiB/s]
random_bytes/thread time: [330.50 ns 331.10 ns 331.87 ns]
thrpt: [2.8736 GiB/s 2.8803 GiB/s 2.8855 GiB/s]
# 4B reads
random_u32/chacha20 time: [1.7969 ns 1.7996 ns 1.8024 ns]
thrpt: [2.0668 GiB/s 2.0701 GiB/s 2.0732 GiB/s]
random_u32/small time: [635.97 ps 636.82 ps 637.73 ps]
thrpt: [5.8414 GiB/s 5.8498 GiB/s 5.8576 GiB/s]
random_u32/os time: [15.710 ns 15.755 ns 15.807 ns]
thrpt: [241.33 MiB/s 242.12 MiB/s 242.81 MiB/s]
random_u32/vdso time: [14.535 ns 14.546 ns 14.558 ns]
thrpt: [262.03 MiB/s 262.25 MiB/s 262.46 MiB/s]
random_u32/thread time: [1.2514 ns 1.2532 ns 1.2556 ns]
thrpt: [2.9668 GiB/s 2.9726 GiB/s 2.9770 GiB/s]
# 8B reads
random_u64/chacha20 time: [3.0436 ns 3.0462 ns 3.0489 ns]
thrpt: [2.4437 GiB/s 2.4459 GiB/s 2.4479 GiB/s]
random_u64/small time: [647.58 ps 648.51 ps 649.48 ps]
thrpt: [11.472 GiB/s 11.489 GiB/s 11.505 GiB/s]
random_u64/os time: [22.305 ns 22.325 ns 22.347 ns]
thrpt: [341.41 MiB/s 341.74 MiB/s 342.05 MiB/s]
random_u64/vdso time: [21.976 ns 21.988 ns 22.003 ns]
thrpt: [346.74 MiB/s 346.97 MiB/s 347.16 MiB/s]
random_u64/thread time: [1.9864 ns 1.9917 ns 1.9981 ns]
thrpt: [3.7289 GiB/s 3.7408 GiB/s 3.7507 GiB/s]
The vdso_rng::LocalState handle is stored in local (not thread-local) storage, so I think this is the best case for vDSO.
First observation: vDSO is exactly the same speed as getrandom.
Second observation: @oxalica's results are similar to my chacha20 and vdso results except 4B chacha20 reads (maybe better buffering in rand).
Conclusion: there isn't anything to gain over what getrandom already does, nor can vDSO come close to a local PRNG, even the same ChaCha20. Either way, it's still fast enough for many uses.
Code is here: https://github.com/rust-random/rand/compare/vdso.
Conclusion: there isn't anything to gain over what getrandom already does, nor can vDSO come close to a local PRNG, even the same ChaCha20.
Well, it's because the getrandom crate uses libc::getrandom which on modern systems uses vDSO. :) One of major rustix's use-cases is libc-free code. If you want to achieve the same with getrandom, you would need to enable the linux_raw opt-in backend, which does raw syscalls without any vDSO support.
Or you can try musl targets. The following results on obtained from Zen5 with MUSL environment. Some are actually quite interesting:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 9950X 16-Core Processor
CPU family: 26
Model: 68
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 58%
CPU max MHz: 5756.4521
CPU min MHz: 624.1940
BogoMIPS: 8584.21
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdt
scp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 s
se4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibr
s_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt cl
wb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni a
vx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pau
sefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_b
italg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_fr
eeze
For benchmarks within vDSO repo:
throughput/rand-fill-64KiB-vgetrandom
time: [108.74 µs 108.88 µs 109.03 µs]
throughput/rand-fill-64KiB-getrandom
time: [59.722 µs 59.784 µs 59.855 µs]
throughput/rand-fill-64KiB-rand-chacha20
time: [36.134 µs 36.187 µs 36.249 µs]
rayon/parallel-fill-vgetrandom
time: [14.120 ms 14.206 ms 14.294 ms]
rayon/parallel-fill-getrandom
time: [66.943 ms 67.135 ms 67.343 ms]
rayon/parallel-fill-rand-chacha20
time: [5.3143 ms 5.3474 ms 5.3803 ms]
random_bytes/chacha20 time: [534.56 ns 537.52 ns 540.58 ns]
thrpt: [1.7642 GiB/s 1.7742 GiB/s 1.7840 GiB/s]
random_bytes/std time: [343.80 ns 346.06 ns 348.34 ns]
thrpt: [2.7377 GiB/s 2.7558 GiB/s 2.7740 GiB/s]
random_bytes/small time: [123.65 ns 124.01 ns 124.39 ns]
thrpt: [7.6667 GiB/s 7.6904 GiB/s 7.7127 GiB/s]
random_bytes/os time: [1.1112 µs 1.1182 µs 1.1274 µs]
thrpt: [866.20 MiB/s 873.34 MiB/s 878.88 MiB/s]
random_bytes/vdso time: [1.8235 µs 1.8271 µs 1.8317 µs]
thrpt: [533.15 MiB/s 534.48 MiB/s 535.54 MiB/s]
random_bytes/thread time: [344.06 ns 345.99 ns 348.05 ns]
thrpt: [2.7401 GiB/s 2.7564 GiB/s 2.7718 GiB/s]
random_u32/chacha20 time: [2.0274 ns 2.0310 ns 2.0353 ns]
thrpt: [1.8303 GiB/s 1.8342 GiB/s 1.8375 GiB/s]
random_u32/std time: [1.2778 ns 1.2801 ns 1.2829 ns]
thrpt: [2.9037 GiB/s 2.9102 GiB/s 2.9154 GiB/s]
random_u32/small time: [505.34 ps 506.13 ps 507.10 ps]
thrpt: [7.3462 GiB/s 7.3604 GiB/s 7.3719 GiB/s]
random_u32/os time: [112.75 ns 112.89 ns 113.11 ns]
thrpt: [33.727 MiB/s 33.791 MiB/s 33.834 MiB/s]
random_u32/vdso time: [14.029 ns 14.043 ns 14.061 ns]
thrpt: [271.29 MiB/s 271.64 MiB/s 271.91 MiB/s]
random_u32/thread time: [1.2779 ns 1.2800 ns 1.2829 ns]
thrpt: [2.9038 GiB/s 2.9104 GiB/s 2.9152 GiB/s]
random_u64/chacha20 time: [3.6681 ns 3.6731 ns 3.6795 ns]
thrpt: [2.0249 GiB/s 2.0284 GiB/s 2.0312 GiB/s]
random_u64/std time: [2.1804 ns 2.1833 ns 2.1875 ns]
thrpt: [3.4060 GiB/s 3.4125 GiB/s 3.4171 GiB/s]
random_u64/small time: [470.33 ps 470.83 ps 471.51 ps]
thrpt: [15.802 GiB/s 15.824 GiB/s 15.841 GiB/s]
random_u64/os time: [113.19 ns 113.40 ns 113.69 ns]
thrpt: [67.107 MiB/s 67.278 MiB/s 67.403 MiB/s]
random_u64/vdso time: [23.850 ns 23.883 ns 23.925 ns]
thrpt: [318.89 MiB/s 319.45 MiB/s 319.90 MiB/s]
random_u64/thread time: [2.1982 ns 2.2010 ns 2.2046 ns]
thrpt: [3.3796 GiB/s 3.3851 GiB/s 3.3894 GiB/s]
I think on Zen5, long byte sequence filling inside vDSO may be less optimized. However, in both test suites, vDSO shows significant performance gain for short entropy generation over the raw syscall.