kani icon indicating copy to clipboard operation
kani copied to clipboard

hangs on string.repeat() ?

Open matthiaskrgr opened this issue 2 years ago • 16 comments

I tried this code:

#[kani::proof]  fn main() {
	
	let x = String::new().repeat(1);
	let z = x.chars().nth(1).unwrap();
}

using the following command line invocation:

RUSTC_WRAPPER="" cargo kani --harness  main --tests

with Kani version: 0.22.0

I expected to see this happen: explanation It looks like there was some kind of hang..? kani/cmbc was crunching on that for more than 5 minutes until I cancelled it. If I remove the repeat(1) , it finishes within a second :thinking:

matthiaskrgr avatar Feb 25 '23 00:02 matthiaskrgr

Hi @matthiaskrgr. Kani currently performs poorly on programs involving strings (e.g. see #2001 and #2015). The issue here seems to be a combination of using repeat and chars because this program that involves repeat verifies quickly:

#[kani::proof]
fn main() {
    let x = String::new().repeat(1);
    assert!(x.is_empty());
}

SUMMARY:
 ** 0 of 322 failed (9 unreachable)

VERIFICATION:- SUCCESSFUL
Verification Time: 0.68734515s

The chars method in particular deals with the UTF-8 encoding, which seems difficult to reason about. If it's possible to express properties in term of bytes, Kani may perform better. For instance, this program is verified in ~1 second:

#[kani::proof]
fn main() {
    let x = String::from("abc").repeat(5);
    let bytes = x.as_bytes();
    assert_eq!(bytes.len(), 15);
}

SUMMARY:
 ** 0 of 322 failed (3 unreachable)

VERIFICATION:- SUCCESSFUL
Verification Time: 1.2358496s

zhassan-aws avatar Feb 25 '23 00:02 zhassan-aws

CC @tautschnig. This doesn't seem to get to the SAT solver stage after running for 20 minutes.

zhassan-aws avatar Feb 25 '23 00:02 zhassan-aws

This is the CBMC call stack after running for an hour:

#270 0x0000556fa14997f3 in irept::compare(irept const&) const ()
#271 0x0000556fa14997f3 in irept::compare(irept const&) const ()
#272 0x0000556fa14997f3 in irept::compare(irept const&) const ()
#273 0x0000556fa14997f3 in irept::compare(irept const&) const ()
#274 0x0000556fa14997f3 in irept::compare(irept const&) const ()
#275 0x0000556fa14997f3 in irept::compare(irept const&) const ()
#276 0x0000556fa149a08d in irept::operator<(irept const&) const ()
#277 0x0000556fa0bad77e in std::_Rb_tree<exprt, exprt, std::_Identity<exprt>, std::less<exprt>, std::allocator<exprt> >::_M_get_insert_unique_pos(exprt const&) ()
#278 0x0000556fa1172a9a in arrayst::update_index_map(unsigned long) ()
#279 0x0000556fa1172c6c in arrayst::update_index_map(bool) ()
#280 0x0000556fa117512f in arrayst::add_array_constraints() ()
#281 0x0000556fa11c7b0f in bv_pointerst::finish_eager_conversion() ()
#282 0x0000556fa11e84dd in prop_conv_solvert::dec_solve() ()
#283 0x0000556fa0cb707c in run_property_decider(incremental_goto_checkert::resultt&, std::map<dstringt, property_infot, std::less<dstringt>, std::allocator<std::pair<dstringt const, property_infot> > >&, goto_symex_property_decidert&, ui_message_handlert&, std::chrono::duration<double, std::ratio<1l, 1l> >, bool) ()
#284 0x0000556fa0cc84f9 in multi_path_symex_checkert::operator()(std::map<dstringt, property_infot, std::less<dstringt>, std::allocator<std::pair<dstringt const, property_infot> > >&) ()
#285 0x0000556fa0b6f299 in all_properties_verifier_with_trace_storaget<multi_path_symex_checkert>::operator()() ()
#286 0x0000556fa0b6b71e in cbmc_parse_optionst::doit() ()
#287 0x0000556fa0b62f9f in parse_options_baset::main() ()
#288 0x0000556fa0b4ebe9 in main ()

zhassan-aws avatar Feb 25 '23 01:02 zhassan-aws

See #2302: this is stuck in applying the array theory. Work-in-progress on the CBMC side.

tautschnig avatar Aug 21 '23 14:08 tautschnig

@tautschnig are the arrays that the array theory is applied to within the String library? Just trying to understand the problem a bit more.

zhassan-aws avatar Aug 21 '23 16:08 zhassan-aws

@tautschnig - what are the array theory issues?

rahulku avatar Sep 22 '23 19:09 rahulku

@tautschnig are the arrays that the array theory is applied to within the String library? Just trying to understand the problem a bit more. @tautschnig - what are the array theory issues?

The problem with CBMC's array theory is that it does an eager encoding of read-over-write lemmas for all pairs of read/writes that are over the same base array, so it has a quadratic blowup in the worst case where all read/write pairs are over the same base array. CBMC's algorithm builds a weak equivalence graph (cf. https://arxiv.org/pdf/1405.6939) for all arrays used in the program and uses the graph to drive lemma instantiation.

Possible optimizations would be:

  • extend symex with interval propagation and prune (read, write) pairs where the range of read indices is disjoint from the range of written indices;
  • make the lemma instantiation incremental, i.e. don't generate all lemmas ahead of time but only when needed to block a spurious counter example (which @tautschnig has been experimenting with);
  • use cargo kani --enable-unstable --cbmc-args --bitwuzla for cases like this one

remi-delmas-3000 avatar Mar 31 '25 17:03 remi-delmas-3000

Thanks @remi-delmas-3000. It's still unclear to me which array is causing a quadratic blowup. The strings in the examples above are either empty or consist of 3-4 bytes. Also, there is no non-determinism involved. My interpretation of a blowup is that as we increase the size of the string, the expression size is going to grow quadratically. But here, there is no size that we're trying to increase in this example. Is my interpretation incorrect?

zhassan-aws avatar Mar 31 '25 17:03 zhassan-aws

I was just giving general knowledge about CBMC's array theory, running Kani 0.59.0 on the first example I get a result in less than a second

SUMMARY:
 ** 1 of 748 failed (31 unreachable)
Failed Checks: called `Option::unwrap()` on a `None` value
 File: ".rustup/toolchains/nightly-2025-03-02-x86_64-apple-darwin/lib/rustlib/src/rust/library/core/src/option.rs", line 2015, in std::option::unwrap_failed

VERIFICATION:- FAILED
Verification Time: 0.6307878s

remi-delmas-3000 avatar Mar 31 '25 20:03 remi-delmas-3000

This one still blows up with the internal array theory solver but passes using --enable-unstable --cbmc-args --bitwuzla:

#[kani::proof]
fn proof_repeat() {
    let s = String::from("a").repeat(1);
    assert_eq!(s.chars().nth(0).unwrap(), 'a');
}
  • String::from triggers a dynamic allocation and a copy from the raw string literal to the heap allocated string,
  • repeat creates one more heap allocated copy, and moves contents in a somewhat complicated way
    #[rustc_allow_incoherent_impl]
    #[cfg(not(no_global_oom_handling))]
    #[stable(feature = "repeat_generic_slice", since = "1.40.0")]
    pub fn repeat(&self, n: usize) -> Vec<T>
    where
        T: Copy,
    {
        if n == 0 {
            return Vec::new();
        }

        // If `n` is larger than zero, it can be split as
        // `n = 2^expn + rem (2^expn > rem, expn >= 0, rem >= 0)`.
        // `2^expn` is the number represented by the leftmost '1' bit of `n`,
        // and `rem` is the remaining part of `n`.

        // Using `Vec` to access `set_len()`.
        let capacity = self.len().checked_mul(n).expect("capacity overflow");
        let mut buf = Vec::with_capacity(capacity);

        // `2^expn` repetition is done by doubling `buf` `expn`-times.
        buf.extend(self);
        {
            let mut m = n >> 1;
            // If `m > 0`, there are remaining bits up to the leftmost '1'.
            while m > 0 {
                // `buf.extend(buf)`:
                unsafe {
                    ptr::copy_nonoverlapping::<T>(
                        buf.as_ptr(),
                        (buf.as_mut_ptr()).add(buf.len()),
                        buf.len(),
                    );
                    // `buf` has capacity of `self.len() * n`.
                    let buf_len = buf.len();
                    buf.set_len(buf_len * 2);
                }

                m >>= 1;
            }
        }

        // `rem` (`= n - 2^expn`) repetition is done by copying
        // first `rem` repetitions from `buf` itself.
        let rem_len = capacity - buf.len(); // `self.len() * rem`
        if rem_len > 0 {
            // `buf.extend(buf[0 .. rem_len])`:
            unsafe {
                // This is non-overlapping since `2^expn > rem`.
                ptr::copy_nonoverlapping::<T>(
                    buf.as_ptr(),
                    (buf.as_mut_ptr()).add(buf.len()),
                    rem_len,
                );
                // `buf.len() + rem_len` equals to `buf.capacity()` (`= self.len() * n`).
                buf.set_len(capacity);
            }
        }
        buf
    }
  • the chars() method creates an iterator and the nth(i) method pulls characters one by one by parsing UTF codepoints one by one in a loop.
fn next(&mut self) -> Option<char> {
    // SAFETY: `str` invariant says `self.iter` is a valid UTF-8 string and
    // the resulting `ch` is a valid Unicode Scalar Value.
    unsafe { next_code_point(&mut self.iter).map(|ch| char::from_u32_unchecked(ch)) }
}
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
    // Decode UTF-8
    let x = *bytes.next()?;
    if x < 128 {
        return Some(x as u32);
    }

    // Multibyte case follows
    // Decode from a byte combination out of: [[[x y] z] w]
    // NOTE: Performance is sensitive to the exact formulation here
    let init = utf8_first_byte(x, 2);
    // SAFETY: `bytes` produces an UTF-8-like string,
    // so the iterator must produce a value here.
    let y = unsafe { *bytes.next().unwrap_unchecked() };
    let mut ch = utf8_acc_cont_byte(init, y);
    if x >= 0xE0 {
        // [[x y z] w] case
        // 5th bit in 0xE0 .. 0xEF is always clear, so `init` is still valid
        // SAFETY: `bytes` produces an UTF-8-like string,
        // so the iterator must produce a value here.
        let z = unsafe { *bytes.next().unwrap_unchecked() };
        let y_z = utf8_acc_cont_byte((y & CONT_MASK) as u32, z);
        ch = init << 12 | y_z;
        if x >= 0xF0 {
            // [x y z w] case
            // use only the lower 3 bits of `init`
            // SAFETY: `bytes` produces an UTF-8-like string,
            // so the iterator must produce a value here.
            let w = unsafe { *bytes.next().unwrap_unchecked() };
            ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);
        }
    }

    Some(ch)
}

So there's at least couple of dynamically allocated arrays involved that get copied over using memcopy, which can yield a lot of elementwise array constraints.

It looks like everything is pretty much constant, so it could be that constant propagation stops working for some unfortunate reason at some intermediate point.

remi-delmas-3000 avatar Mar 31 '25 21:03 remi-delmas-3000

I see. Thanks for digging into this. So, if I understand correctly, this is likely not an array-theory issue, but rather a combination of complex code and the failure of constant propagation to simplify the expressions?

zhassan-aws avatar Apr 01 '25 00:04 zhassan-aws

There are improvements in CBMC's simplifier that I will create PRs for, but it also seems that

let offset_bytes = ptr1.addr().wrapping_sub(ptr2.addr()) as isize;

from Kani's model of ptr_offset_from creates an expression that we would have to write a rather peculiar simplification rule for. I'd like to review this on the Kani side and rewrite the above using pointer_offset expressions.

tautschnig avatar Apr 16 '25 12:04 tautschnig

There are improvements in CBMC's simplifier that I will create PRs for, but it also seems that

let offset_bytes = ptr1.addr().wrapping_sub(ptr2.addr()) as isize;

from Kani's model of ptr_offset_from creates an expression that we would have to write a rather peculiar simplification rule for. I'd like to review this on the Kani side and rewrite the above using pointer_offset expressions.

The fix for the above is

--- a/library/kani_core/src/models.rs
+++ b/library/kani_core/src/models.rs
@@ -78,7 +78,7 @@ pub unsafe fn ptr_offset_from<T>(ptr1: *const T, ptr2: *const T) -> isize {
                         "Offset result and original pointer should point to the same allocation",
                     );
                     // The offset must fit in isize since this represents the same allocation.
-                    let offset_bytes = ptr1.addr().wrapping_sub(ptr2.addr()) as isize;
+                    let offset_bytes = kani::mem::pointer_offset(ptr1).wrapping_sub(kani::mem::pointer_offset(ptr2)) as isize;
                     let t_size = size_of::<T>() as isize;
                     kani::safety_check(
                         offset_bytes % t_size == 0,

but I don't know why, ever since #3323, we'd require -Z ghost-state to use kani::mem::pointer_offset.

tautschnig avatar Apr 22 '25 12:04 tautschnig

How is ptr.addr() codegen'ed? Perhaps we can implement addr using mem::pointer_offset?

zhassan-aws avatar Apr 23 '25 03:04 zhassan-aws

#3323 adds a feature gate for pointer_object and pointer_offset for library/kani/src/mem.rs but the same functions library/kani_core/src/mem.rs are just pub(crate).

Could we make them pub(crate) as well in library/kani/src/mem.rs ?

remi-delmas-3000 avatar Apr 23 '25 13:04 remi-delmas-3000

Yes

zhassan-aws avatar Apr 23 '25 15:04 zhassan-aws