rustix icon indicating copy to clipboard operation
rustix copied to clipboard

Add (opt-in) vDSO support for `getrandom`?

Open newpavlov opened this issue 1 year ago • 8 comments

This Go issue can be used as a reference: https://github.com/golang/go/issues/69577

newpavlov avatar Oct 16 '24 07:10 newpavlov

I'm not familar with this feature yet, so I'd appreciate any guidance from anyone who is is.

At a quick glance at the glibc patch, it has special handling in pthread_create and fork. If rustix has its own implementation, it's not immediately clear how rustix should handle this, since it isn't in charge of thread creation and cleanup.

One option would be for rustix to just have a minimal wrapper around the vDSO function, rather than providing a full-service getrandom, and just require the main state management code to live elsewhere. That's relatively easy to do, and I imagine something like origin would be able to do the state management because it can assume it's in charge of all thread and process creation, however it's not clear that anything else could.

sunfishcode avatar Oct 16 '24 13:10 sunfishcode

I have a PoC implementation at https://github.com/SchrodingerZhu/useless/tree/main/crates/vdso-rng.

I really hoped to use rustix but this crate does not expose the low-level access to the vDSO and auxv. So I wrote the parser on my own.

Is it possible to provide such APIs?

SchrodingerZhu avatar Jul 13 '25 15:07 SchrodingerZhu

I have a PoC implementation at https://github.com/SchrodingerZhu/useless/tree/main/crates/vdso-rng.

Thanks for the work. I tested the performance of vDSO getrandom using your library but with some benchmark code modification for fair comparison[^1]. Here is the result[^2] on Ryzen 7 5700G:

getrandom 4B, flags=0: vdso 17.2ns, syscall 295ns, user-impl-with-reseed 8.9ns getrandom 64KiB, flags=0: vdso 85.4μs, syscall 122μs, user-impl-with-reseed 25.4μs

The vDSO is ~16x faster than syscall for tiny buffers, ~62% faster than syscall for relatively large buffers. But in all cases, vDSO is still much slower than user ChaCha20 implementation even with auto-reseeding using getrandom syscall. vDSO impl is probably "more secure" since it knows the best time to reseed and that frequency may be higher than user-impl's default 64KiB. But it's hard to tell the difference just by looking the output.

IMO, it's good for rustix to support vDSO, but it's probably not good enough to be called frequently for regular non-security-critical use. For infrequent reseeding purpose of user RNGs, switching to vDSO should have negligible performance difference either (25.4μs vs 25.1μs per 64KiB).

[^1]: Linux kernel use 20-round ChaCha instead of the default algorithm 12-round ChaCha in rand::rngs::ThreadRng. I use the rand_chacha::ChaCha20Core, wrapped in ReseedingRng of syscall getrandom, with default reseeding period of 64KiB for the benchmark. [^2]: I double checked the perf report, and most of time is indeed spent in the relevant call (vDSO call). The library does not introduce noticeable overhead.

oxalica avatar Jul 14 '25 22:07 oxalica

Some detail: Userspace reseeding has problems when you do a virtual machine fork, it can be hard for the generator to sense the fork and thus result in at least a short sequence of identical results. Presumably, the vDSO getrandom has a way to identify such events.

(I am personally not a security person.)

SchrodingerZhu avatar Jul 14 '25 23:07 SchrodingerZhu

Some detail: Userspace reseeding has problems when you do a virtual machine fork, it can be hard for the generator to sense the fork and thus result in at least a short sequence of identical results. Presumably, the vDSO getrandom has a way to identify such events.

Yeah, vDSO getrandom is still a good substitute for getrandom syscall, if not considering user RNGs[^1]. For security-critical cases like key generation, the syscall is infrequent enough and not a bottleneck comparing to other cryptographic operations, rendering vDSO only a feature that is good to have but benefits little.

[^1]: It's said that, "the security of a kernel CSPRNG with the speed of a userspace CSPRNG", while the benchmark disagrees.

oxalica avatar Jul 15 '25 01:07 oxalica

To add my own benchmarks (Ryzen 5800X, Linux 6.15.4-200.fc42.x86_64):

$ cargo bench --bench generators

# 1kiB reads
random_bytes/chacha20   time:   [464.27 ns 465.36 ns 466.72 ns]
                        thrpt:  [2.0434 GiB/s 2.0493 GiB/s 2.0541 GiB/s]
random_bytes/small      time:   [167.64 ns 168.09 ns 168.66 ns]
                        thrpt:  [5.6544 GiB/s 5.6737 GiB/s 5.6889 GiB/s]
random_bytes/os         time:   [1.4163 µs 1.4175 µs 1.4189 µs]
                        thrpt:  [688.26 MiB/s 688.91 MiB/s 689.54 MiB/s]
random_bytes/vdso       time:   [1.4227 µs 1.4234 µs 1.4243 µs]
                        thrpt:  [685.64 MiB/s 686.06 MiB/s 686.42 MiB/s]
random_bytes/thread     time:   [330.50 ns 331.10 ns 331.87 ns]
                        thrpt:  [2.8736 GiB/s 2.8803 GiB/s 2.8855 GiB/s]

# 4B reads
random_u32/chacha20     time:   [1.7969 ns 1.7996 ns 1.8024 ns]
                        thrpt:  [2.0668 GiB/s 2.0701 GiB/s 2.0732 GiB/s]
random_u32/small        time:   [635.97 ps 636.82 ps 637.73 ps]
                        thrpt:  [5.8414 GiB/s 5.8498 GiB/s 5.8576 GiB/s]
random_u32/os           time:   [15.710 ns 15.755 ns 15.807 ns]
                        thrpt:  [241.33 MiB/s 242.12 MiB/s 242.81 MiB/s]
random_u32/vdso         time:   [14.535 ns 14.546 ns 14.558 ns]
                        thrpt:  [262.03 MiB/s 262.25 MiB/s 262.46 MiB/s]
random_u32/thread       time:   [1.2514 ns 1.2532 ns 1.2556 ns]
                        thrpt:  [2.9668 GiB/s 2.9726 GiB/s 2.9770 GiB/s]

# 8B reads
random_u64/chacha20     time:   [3.0436 ns 3.0462 ns 3.0489 ns]
                        thrpt:  [2.4437 GiB/s 2.4459 GiB/s 2.4479 GiB/s]
random_u64/small        time:   [647.58 ps 648.51 ps 649.48 ps]
                        thrpt:  [11.472 GiB/s 11.489 GiB/s 11.505 GiB/s]
random_u64/os           time:   [22.305 ns 22.325 ns 22.347 ns]
                        thrpt:  [341.41 MiB/s 341.74 MiB/s 342.05 MiB/s]
random_u64/vdso         time:   [21.976 ns 21.988 ns 22.003 ns]
                        thrpt:  [346.74 MiB/s 346.97 MiB/s 347.16 MiB/s]
random_u64/thread       time:   [1.9864 ns 1.9917 ns 1.9981 ns]
                        thrpt:  [3.7289 GiB/s 3.7408 GiB/s 3.7507 GiB/s]

The vdso_rng::LocalState handle is stored in local (not thread-local) storage, so I think this is the best case for vDSO.

First observation: vDSO is exactly the same speed as getrandom.

Second observation: @oxalica's results are similar to my chacha20 and vdso results except 4B chacha20 reads (maybe better buffering in rand).

Conclusion: there isn't anything to gain over what getrandom already does, nor can vDSO come close to a local PRNG, even the same ChaCha20. Either way, it's still fast enough for many uses.

Code is here: https://github.com/rust-random/rand/compare/vdso.

dhardy avatar Jul 23 '25 14:07 dhardy

Conclusion: there isn't anything to gain over what getrandom already does, nor can vDSO come close to a local PRNG, even the same ChaCha20.

Well, it's because the getrandom crate uses libc::getrandom which on modern systems uses vDSO. :) One of major rustix's use-cases is libc-free code. If you want to achieve the same with getrandom, you would need to enable the linux_raw opt-in backend, which does raw syscalls without any vDSO support.

newpavlov avatar Jul 23 '25 14:07 newpavlov

Or you can try musl targets. The following results on obtained from Zen5 with MUSL environment. Some are actually quite interesting:

Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      32
  On-line CPU(s) list:       0-31
Vendor ID:                   AuthenticAMD
  Model name:                AMD Ryzen 9 9950X 16-Core Processor
    CPU family:              26
    Model:                   68
    Thread(s) per core:      2
    Core(s) per socket:      16
    Socket(s):               1
    Stepping:                0
    Frequency boost:         enabled
    CPU(s) scaling MHz:      58%
    CPU max MHz:             5756.4521
    CPU min MHz:             624.1940
    BogoMIPS:                8584.21
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdt
                             scp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 s
                             se4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
                             skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibr
                             s_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt cl
                             wb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni a
                             vx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pau
                             sefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_b
                             italg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_fr
                             eeze

For benchmarks within vDSO repo:

throughput/rand-fill-64KiB-vgetrandom
                        time:   [108.74 µs 108.88 µs 109.03 µs]
throughput/rand-fill-64KiB-getrandom
                        time:   [59.722 µs 59.784 µs 59.855 µs]
throughput/rand-fill-64KiB-rand-chacha20
                        time:   [36.134 µs 36.187 µs 36.249 µs]

rayon/parallel-fill-vgetrandom
                        time:   [14.120 ms 14.206 ms 14.294 ms]
rayon/parallel-fill-getrandom
                        time:   [66.943 ms 67.135 ms 67.343 ms]
rayon/parallel-fill-rand-chacha20
                        time:   [5.3143 ms 5.3474 ms 5.3803 ms]

random_bytes/chacha20   time:   [534.56 ns 537.52 ns 540.58 ns]
                        thrpt:  [1.7642 GiB/s 1.7742 GiB/s 1.7840 GiB/s]
random_bytes/std        time:   [343.80 ns 346.06 ns 348.34 ns]
                        thrpt:  [2.7377 GiB/s 2.7558 GiB/s 2.7740 GiB/s]
random_bytes/small      time:   [123.65 ns 124.01 ns 124.39 ns]
                        thrpt:  [7.6667 GiB/s 7.6904 GiB/s 7.7127 GiB/s]
random_bytes/os         time:   [1.1112 µs 1.1182 µs 1.1274 µs]
                        thrpt:  [866.20 MiB/s 873.34 MiB/s 878.88 MiB/s]
random_bytes/vdso       time:   [1.8235 µs 1.8271 µs 1.8317 µs]
                        thrpt:  [533.15 MiB/s 534.48 MiB/s 535.54 MiB/s]
random_bytes/thread     time:   [344.06 ns 345.99 ns 348.05 ns]
                        thrpt:  [2.7401 GiB/s 2.7564 GiB/s 2.7718 GiB/s]

random_u32/chacha20     time:   [2.0274 ns 2.0310 ns 2.0353 ns]
                        thrpt:  [1.8303 GiB/s 1.8342 GiB/s 1.8375 GiB/s]
random_u32/std          time:   [1.2778 ns 1.2801 ns 1.2829 ns]
                        thrpt:  [2.9037 GiB/s 2.9102 GiB/s 2.9154 GiB/s]
random_u32/small        time:   [505.34 ps 506.13 ps 507.10 ps]
                        thrpt:  [7.3462 GiB/s 7.3604 GiB/s 7.3719 GiB/s]
random_u32/os           time:   [112.75 ns 112.89 ns 113.11 ns]
                        thrpt:  [33.727 MiB/s 33.791 MiB/s 33.834 MiB/s]
random_u32/vdso         time:   [14.029 ns 14.043 ns 14.061 ns]
                        thrpt:  [271.29 MiB/s 271.64 MiB/s 271.91 MiB/s]
random_u32/thread       time:   [1.2779 ns 1.2800 ns 1.2829 ns]
                        thrpt:  [2.9038 GiB/s 2.9104 GiB/s 2.9152 GiB/s]

random_u64/chacha20     time:   [3.6681 ns 3.6731 ns 3.6795 ns]
                        thrpt:  [2.0249 GiB/s 2.0284 GiB/s 2.0312 GiB/s]
random_u64/std          time:   [2.1804 ns 2.1833 ns 2.1875 ns]
                        thrpt:  [3.4060 GiB/s 3.4125 GiB/s 3.4171 GiB/s]
random_u64/small        time:   [470.33 ps 470.83 ps 471.51 ps]
                        thrpt:  [15.802 GiB/s 15.824 GiB/s 15.841 GiB/s]
random_u64/os           time:   [113.19 ns 113.40 ns 113.69 ns]
                        thrpt:  [67.107 MiB/s 67.278 MiB/s 67.403 MiB/s]
random_u64/vdso         time:   [23.850 ns 23.883 ns 23.925 ns]
                        thrpt:  [318.89 MiB/s 319.45 MiB/s 319.90 MiB/s]
random_u64/thread       time:   [2.1982 ns 2.2010 ns 2.2046 ns]
                        thrpt:  [3.3796 GiB/s 3.3851 GiB/s 3.3894 GiB/s]

I think on Zen5, long byte sequence filling inside vDSO may be less optimized. However, in both test suites, vDSO shows significant performance gain for short entropy generation over the raw syscall.

SchrodingerZhu avatar Jul 23 '25 16:07 SchrodingerZhu