s2n-tls Replace memcmp to s2n_constant_time

Replace memcmp to s2n_constant_time_equals

Open boquan-fang opened this issue 6 months ago • 2 comments

Resolved issues:

Solving issue #3062

Description of changes:

Change most of memcmp usages to s2n_constant_time_equals.

There are two parts that weren't change, which are in s2n_cipher_suits.c (line 1110) and s2n_config.c (line 323). Both of those parts use the return integer from memcmp which s2n_constant_time_equals can't do.

Regression Tests Results

We recently added new regression tests that lets us know the instruction count differences over some common codepaths.

--------------------------------------------------------------------------------
-- Summary
--------------------------------------------------------------------------------
Ir____________ 

1,707 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
-- File:function summary
--------------------------------------------------------------------------------
  Ir____________________  file:function

< 2,190 (128.3%, 128.3%)  /home/ubuntu/workspace/s2n-tls/bindings/rust/s2n-tls-sys/lib/utils/s2n_safety.c:s2n_constant_time_equals

<  -384 (-22.5%, 105.8%)  ./string/../sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S:__memcmp_avx2_movbe

<   -62  (-3.6%, 102.2%)  /rust/deps/hashbrown-0.14.5/src/raw/mod.rs:
    -58  (-3.4%)            hashbrown::map::HashMap<K,V,S,A>::insert
     -4  (-0.2%)            core::iter::adapters::try_process

<   -34  (-2.0%, 100.2%)  /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/../../stdarch/crates/core_arch/src/x86/sse2.rs:
    -30  (-1.8%)            hashbrown::map::HashMap<K,V,S,A>::insert
     -2  (-0.1%)            hashbrown::raw::RawTable<T,A>::reserve_rehash
     -2  (-0.1%)            core::iter::adapters::try_process

This shows us that there is about a 1,707 instruction count penalty for shifting to s2n_constant_time_equals. The main differences are the additional 2,190 instruction from s2n_constant_time_equals and the 384 instruction count reduction from memcmp. Relative to the 79,292,794 instructions for a complete RSA handshake, this is a 0.00215 % regression in handshake performance.

Note: The hashbrown instruction count differences are due to randomness between runs.

Handshake Benchmarks

We also have benchmarks covering handshake and throughput performance. Comparing these benchmarks between this PR and mainline we see the following differences.


handshake-rsa2048/s2n-tls
                        time:   [1.0697 ms 1.0702 ms 1.0707 ms]
                        change: [-0.8251% +0.1691% +1.2108%] (p = 0.78 > 0.05)
                        No change in performance detected.

handshake-rsa3072/s2n-tls
                        time:   [2.9331 ms 2.9364 ms 2.9396 ms]
                        change: [-0.1189% +0.0360% +0.1889%] (p = 0.65 > 0.05)
                        No change in performance detected.
                        
handshake-rsa4096/s2n-tls
                        time:   [6.2232 ms 6.2324 ms 6.2420 ms]
                        change: [-0.1699% +0.0467% +0.2659%] (p = 0.68 > 0.05)
                        No change in performance detected.

handshake-ecdsa384/s2n-tls
                        time:   [1.1160 ms 1.1162 ms 1.1164 ms]
                        change: [-0.1847% +0.0291% +0.2434%] (p = 0.81 > 0.05)
                        No change in performance detected.

handshake-ecdsa256/s2n-tls
                        time:   [398.52 µs 398.62 µs 398.74 µs]
                        change: [+0.3657% +0.7706% +1.1642%] (p = 0.00 < 0.05)
                        Change within noise threshold.

throughput-AES_128_GCM_SHA256/s2n-tls
                        time:   [108.28 µs 108.70 µs 109.14 µs]
                        thrpt:  [873.83 MiB/s 877.37 MiB/s 880.71 MiB/s]
                 change:
                        time:   [+0.8096% +1.3883% +1.9335%] (p = 0.00 < 0.05)
                        thrpt:  [-1.8968% -1.3693% -0.8031%]
                        Change within noise threshold.

throughput-AES_256_GCM_SHA384/s2n-tls
                        time:   [118.10 µs 118.55 µs 119.01 µs]
                        thrpt:  [801.35 MiB/s 804.47 MiB/s 807.49 MiB/s]
                 change:
                        time:   [-1.3449% -0.9600% -0.5646%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5678% +0.9693% +1.3632%]

As expected, we do not see any significant changes. While there are some slight differences reported, these are below the accuracy threshold of the benchmarks.

s2n_constant_time_equals vs memcmp Benchmarks

We ran criterion benchmarks comparing the performance of memcmp against s2n_constant_time_equals.

small data, not equal: how long does it take to compare an unequal blob of 255 bytes?
small data, equal: how long does it take to compare an equal blob of 255 bytes?

These cases give numbers for the common case of comparing small pieces of data.

many small blobs, not equal: how long does it take to compare 16 kB of 255 byte blobs.

This is a representation of a pathological case possible in places like s2n_protocol_preferences.c. Generally there would be only a small number of protocols, but it is possible for a client to send up to 16 kB of preferences, where each protocol has a maximum size of 255 bytes. This case gives us an understanding of the performance impact in this worst case scenario.

# 386 ps vs 130,280 ps
low-level-comparison - small data, not equal/memcmp
                        time:   [386.08 ps 386.10 ps 386.12 ps]

low-level-comparison - small data, not equal/s2n_constant_time_equals
                        time:   [130.28 ns 130.28 ns 130.29 ns]


# 386 ps vs 130,300 ps
low-level-comparison - small data, equal/memcmp
                        time:   [386.07 ps 386.09 ps 386.11 ps]

low-level-comparison - small data, equal/s2n_constant_time_equals_c
                        time:   [130.29 ns 130.30 ns 130.31 ns]


# 519 ns vs 30,593 ns
low-level-comparison - many small blobs, not equal/memcmp
                        time:   [519.23 ns 519.37 ns 519.50 ns]

low-level-comparison - many small blobs, not equal/s2n_constant_time_equals
                        time:   [30.592 µs 30.593 µs 30.595 µs]

We see that while s2n_constant_time_equals is significantly slower than memcmp, it remains a very fast function, and its cost is small relative to the cost of an entire handshake (120ns / 1_000_000ns => .0000012% ).

For pathological cases (comparing an entire 16kb extension) the s2n_const_time_equals would represent a noticeable portion of the handshake (30us / 1_000us => 3%). Therefore we avoid the usage of s2n_constant_time_equals in scenarios where largeish data (> 1kB) would be compared.

Call-outs:

Needs to make a note to those two parts that aren't changed.

Testing:

This change is a refactor change. The code ran and passed all test cases via cmake.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Aug 15 '24 17:08 boquan-fang

s2n-tls s2n-tls copied to clipboard

Replace memcmp to s2n_constant_time_equals

Resolved issues:

Description of changes:

Regression Tests Results

Handshake Benchmarks

s2n_constant_time_equals vs memcmp Benchmarks

Call-outs:

Testing:

s2n-tls
s2n-tls copied to clipboard