kani icon indicating copy to clipboard operation
kani copied to clipboard

Out of memory with Kani 0.46.0 on an s2n-quic harness

Open zhassan-aws opened this issue 1 year ago • 11 comments

Kani v0.46.0 runs out of memory on the inet::checksum::tests::differential harness in s2n-quic.

This is an example run that runs out of memory:

https://github.com/aws/s2n-quic/actions/runs/7892365794/job/21538690964

This is an example passing run with Kani 0.45.0:

https://github.com/aws/s2n-quic/actions/runs/7837153634/job/21386277828

zhassan-aws avatar Feb 15 '24 17:02 zhassan-aws

I traced this back to this upstream change:

https://github.com/rust-lang/rust/pull/118578

Replacing all the split_at calls in https://github.com/aws/s2n-quic/blob/cf0b40ec213ba9437c8c4dfac43cca593b15e62f/quic/s2n-quic-core/src/inet/checksum.rs#L408 with split_at_unchecked (which requires using unsafe and adding #![feature(slice_split_at_unchecked)]), verification succeeds.

The upstream change effectively added an extra layer of Option<(&[T], &[T])> that split_at needs to unwrap (which it does through a match). This extra layer seems to result in a significant increase in the symex run time/unwinding. The CBMC stats with Kani 0.46.0 are:

Runtime Symex: 231.01s
size of program expression: 4084816 steps
slicing removed 3624753 assignments
Generated 94180 VCC(s), 36701 remaining after simplification
Runtime Postprocess Equation: 9.21824s
Passing problem to propositional reduction
converting SSA
Runtime Convert SSA: 23.0261s

whereas for a passing run with Kani 0.45.0, the stats are:

Runtime Symex: 11.9535s
size of program expression: 348664 steps
slicing removed 274660 assignments
Generated 20558 VCC(s), 6312 remaining after simplification
Runtime Postprocess Equation: 0.413516s
Passing problem to propositional reduction
converting SSA
Runtime Convert SSA: 1.28765s
Running propositional reduction
Post-processing
Runtime Post-process: 0.145403s

zhassan-aws avatar Feb 15 '24 17:02 zhassan-aws

@tautschnig can you take a look?

zhassan-aws avatar Feb 15 '24 17:02 zhassan-aws

Replacing all the split_at calls in https://github.com/aws/s2n-quic/blob/cf0b40ec213ba9437c8c4dfac43cca593b15e62f/quic/s2n-quic-core/src/inet/checksum.rs#L408 with split_at_unchecked (which requires using unsafe and adding #![feature(slice_split_at_unchecked)]), verification succeeds.

Actually, just replacing the split_at call in write_sized_generic with split_at_unchecked is sufficient to make verification succeed. This is the complete diff:

diff --git a/quic/s2n-quic-core/src/inet/checksum.rs b/quic/s2n-quic-core/src/inet/checksum.rs
index 551c88b9..9b0843ae 100644
--- a/quic/s2n-quic-core/src/inet/checksum.rs
+++ b/quic/s2n-quic-core/src/inet/checksum.rs
@@ -58,7 +58,7 @@ fn write_sized_generic<'a, const MAX_LEN: usize, const CHUNK_LEN: usize>(
     //# }
 
     while bytes.len() >= MAX_LEN {
-        let (chunks, remaining) = bytes.split_at(MAX_LEN);
+        let (chunks, remaining) = unsafe { bytes.split_at_unchecked(MAX_LEN) };
 
         bytes = remaining;
 
@@ -404,12 +404,12 @@ mod tests {
     // Reduce the length to 4 for Kani until
     // https://github.com/model-checking/kani/issues/3030 is fixed
     #[cfg(any(kani, miri))]
-    const LEN: usize = if cfg!(kani) { 4 } else { 32 };
+    const LEN: usize = if cfg!(kani) { 16 } else { 32 };
 
     /// * Compares the implementation to a port of the C code defined in the RFC
     /// * Ensures partial writes are correctly handled, even if they're not at a 16 bit boundary
     #[test]
-    #[cfg_attr(kani, kani::proof, kani::unwind(9), kani::solver(cadical))]
+    #[cfg_attr(kani, kani::proof, kani::unwind(17), kani::solver(cadical))]
     fn differential() {
         #[cfg(any(kani, miri))]
         type Bytes = crate::testing::InlineVec<u8, LEN>;
diff --git a/quic/s2n-quic-core/src/lib.rs b/quic/s2n-quic-core/src/lib.rs
index 4ec0e3cf..65971201 100644
--- a/quic/s2n-quic-core/src/lib.rs
+++ b/quic/s2n-quic-core/src/lib.rs
@@ -3,6 +3,7 @@
 
 #![cfg_attr(not(any(test, feature = "std")), no_std)]
 
+#![feature(slice_split_at_unchecked)]
 #[cfg(feature = "alloc")]
 extern crate alloc;

zhassan-aws avatar Feb 16 '24 01:02 zhassan-aws

What's weird is the line above the split_at checks to make sure that it's in bounds.

while bytes.len() >= MAX_LEN {

I guess I don't know how much optimizations we do in MIR or GOTO but I would expect the split_at to be equivalent to split_at_unchecked with that check. Even with --opt-level=1, all of the panics are removed with the LLVM backend: https://godbolt.org/z/rjcqn67Kc.

camshaft avatar Feb 16 '24 02:02 camshaft

It seems the main problem is indeed the bounds check performed by split_at, which in turn results in a number of further operations that are conditional on that. What we seemingly would need, and this is in line with what @camshaft said, is an interval abstract domain running alongside the analysis. We want to do this for loops anyway, so we might as well extend the use to also cover cases like this one.

tautschnig avatar Feb 16 '24 13:02 tautschnig

FYI, I’ve no idea what Kani is or what it does, but if you’re happy with nightly features, the proper way to do loop seen in the diff is:

    while let Some((chunks, remaining)) = bytes.split_at(MAX_LEN) {

Or if MAX_LEN is a constant you can use split_first_chunk:

    while let Some((chunks, remaining)) = bytes.split_first_chunk::<{ MAX_LEN }>() {

mina86 avatar Feb 20 '24 11:02 mina86

Replacing all the split_at calls in https://github.com/aws/s2n-quic/blob/cf0b40ec213ba9437c8c4dfac43cca593b15e62f/quic/s2n-quic-core/src/inet/checksum.rs#L408 with split_at_unchecked (which requires using unsafe and adding #![feature(slice_split_at_unchecked)]), verification succeeds.

Actually, just replacing the split_at call in write_sized_generic with split_at_unchecked is sufficient to make verification succeed. This is the complete diff:

diff --git a/quic/s2n-quic-core/src/inet/checksum.rs b/quic/s2n-quic-core/src/inet/checksum.rs
index 551c88b9..9b0843ae 100644
--- a/quic/s2n-quic-core/src/inet/checksum.rs
+++ b/quic/s2n-quic-core/src/inet/checksum.rs
@@ -58,7 +58,7 @@ fn write_sized_generic<'a, const MAX_LEN: usize, const CHUNK_LEN: usize>(
     //# }
 
     while bytes.len() >= MAX_LEN {
-        let (chunks, remaining) = bytes.split_at(MAX_LEN);
+        let (chunks, remaining) = unsafe { bytes.split_at_unchecked(MAX_LEN) };
 
         bytes = remaining;
 
@@ -404,12 +404,12 @@ mod tests {
     // Reduce the length to 4 for Kani until
     // https://github.com/model-checking/kani/issues/3030 is fixed
     #[cfg(any(kani, miri))]
-    const LEN: usize = if cfg!(kani) { 4 } else { 32 };
+    const LEN: usize = if cfg!(kani) { 16 } else { 32 };
 
     /// * Compares the implementation to a port of the C code defined in the RFC
     /// * Ensures partial writes are correctly handled, even if they're not at a 16 bit boundary
     #[test]
-    #[cfg_attr(kani, kani::proof, kani::unwind(9), kani::solver(cadical))]
+    #[cfg_attr(kani, kani::proof, kani::unwind(17), kani::solver(cadical))]
     fn differential() {
         #[cfg(any(kani, miri))]
         type Bytes = crate::testing::InlineVec<u8, LEN>;
diff --git a/quic/s2n-quic-core/src/lib.rs b/quic/s2n-quic-core/src/lib.rs
index 4ec0e3cf..65971201 100644
--- a/quic/s2n-quic-core/src/lib.rs
+++ b/quic/s2n-quic-core/src/lib.rs
@@ -3,6 +3,7 @@
 
 #![cfg_attr(not(any(test, feature = "std")), no_std)]
 
+#![feature(slice_split_at_unchecked)]
 #[cfg(feature = "alloc")]
 extern crate alloc;

Can we stub this for now? Also, why wasn't this caught in our regression?

celinval avatar Feb 20 '24 19:02 celinval

Can we stub this for now?

We worked around the issue in https://github.com/aws/s2n-quic/pull/2128. However, I wanted to capture this issue and see if we could return to the original length.

why wasn't this caught in our regression?

From my understanding, there are only a handful of s2n-quic harnesses running in the Kani CI. This harness was added ~8 months ago and isn't captured in that list. Even if the Kani CI was running all of the harnesses in s2n-quic, the submodule was quite old and didn't have this harness.

camshaft avatar Feb 20 '24 19:02 camshaft

Can we stub this for now?

That should be doable.

Also, why wasn't this caught in our regression?

We hadn't updated the s2n-quic submodule for some time, so this harness wasn't included in our regressions. We've now updated it.

zhassan-aws avatar Feb 20 '24 19:02 zhassan-aws

This might be happening again:

  • https://github.com/aws/s2n-quic/actions/runs/8742616782/job/23991319945?pr=2184
  • https://github.com/aws/s2n-quic/actions/runs/8742616782/job/23993046067?pr=2184#step:3:36574

WesleyRosenblum avatar Apr 18 '24 19:04 WesleyRosenblum

This might be happening again:

* https://github.com/aws/s2n-quic/actions/runs/8742616782/job/23991319945?pr=2184

* https://github.com/aws/s2n-quic/actions/runs/8742616782/job/23993046067?pr=2184#step:3:36574

This wouldn't be too surprising to me: in https://github.com/model-checking/kani/pull/3117#issuecomment-2036616715 I logged that we appeared to be very narrowly staying within the limits of available memory for this test.

tautschnig avatar Apr 18 '24 21:04 tautschnig