ring icon indicating copy to clipboard operation
ring copied to clipboard

'rsa::keypair::tests::test_rsakeypair_private_exponentiate' panicked at 'attempt to subtract with overflow', aarch64-pc-windows-msvc

Open briansmith opened this issue 7 months ago • 10 comments

commit 934ee03e67fb74322af9a805c2dd19893e461d4a, CI job "test (aarch64-pc-windows-msvc, --release, 1.66.0)" failed on the main branch.

failures:

---- rsa::keypair::tests::test_rsakeypair_private_exponentiate stdout ----
thread 'rsa::keypair::tests::test_rsakeypair_private_exponentiate' panicked at 'attempt to subtract with overflow', src\arithmetic\bigint\modulus.rs:155:38


failures:
    rsa::keypair::tests::test_rsakeypair_private_exponentiate

test result: FAILED. 92 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.69s

error: test failed, to rerun pass `-p ring --lib`
Error: Process completed with exit code 1.

briansmith avatar May 21 '25 22:05 briansmith

job-logs.txt

briansmith avatar May 21 '25 22:05 briansmith

This seems to happen consistently in CI. Not that it is Rust 1.66 but not Stable. Wouldn't be surprised if our RSA assembly code is violating ABI or similar. Need to find somebody to debug it.

briansmith avatar May 21 '25 22:05 briansmith

The panic points to this line failing: https://github.com/briansmith/ring/blob/934ee03e67fb74322af9a805c2dd19893e461d4a/src/arithmetic/bigint/modulus.rs#L155

This hasn't changed for a very long time, so I'm guessing it is either a codegen bug or longstanding UB in our code. We should debug it and find out what the values of r and lg_m are.

briansmith avatar May 21 '25 22:05 briansmith

A workaround, avoiding running the tests on aarch64-pc-windows-msvc 1.66 release, was put into CI in PR #2538.

briansmith avatar May 23 '25 16:05 briansmith

So far, this failure has only occurred in --release mode, and only for Rust 1.66 (not stable). And it is intermittent; on most CI runs the test fails, but on some runs the job passes. Also the only failure I've seen is on this specific test.

briansmith avatar May 23 '25 16:05 briansmith

This is easily reproducible using Rust 1.66 on the latest Windows 11 ARM64 using the latest Visual Studio using on my MacBook in Parallels:

while cargo +1.66 test --lib --release test_rsakeypair_private_exponentiate -- --nocapture ; do :; done

I did a bisect after export RUSTFLAGS="-C overflow-checks=on" (We turned on that flag in the Cargo release profile recently in 9a5d0a74614be3d00d74045691cd01baa8417b9c).

After 2a430d1982c7681777e79ee0086ee170f290d25f ("Remove lto=true in Cargo.toml profiles."), we get the "attempt to subtract with overflow" error. But before that commit, at 2a430d1982c7681777e79ee0086ee170f290d25f^ and earlier, the test fails with a different error:

thread 'rsa::keypair::tests::test_rsakeypair_private_exponentiate' panicked at 'called `Result::unwrap()` on an `Err` value: Unspecified', src\rsa\keypair.rs:679:85

The bisect narrowed it down to

commit bdce4fc0e7068e70e9172bda7dcaa63a1145ca94 (HEAD)
Author: Brian Smith <[email protected]>
Date:   Mon Feb 10 15:10:22 2025 -0800

    arithmetic: Reverse private exponent words during key loading.

    Instead of scanning the private exponent limbs backwards during
    exponentiation, reverse the limbs during key loading and then scan
    forward.

Indeed, if I revert that commit on the tip of the main branch, the problem fails to reproduce.

Next, I reset to the current tip of main (reverting the revert) and then tried to reproduce with Rust 1.65 and Rust 1.64. This requires tweaking Cargo.toml:

- rust-version = "1.66.0"
+ rust-version = "1.64.0"

Then, tweaking the STR accordingly, the STR reproduce the issue using 1.65:

while cargo +1.65 test --lib --release test_rsakeypair_private_exponent
iate -- --nocapture ; do :; done

But, the issue does NOT reproduce with 1.64:

while cargo +1.64 test --lib --release test_rsakeypair_private_exponent
iate -- --nocapture ; do :; done

And the issue does NOT reproduce with 1.67 either:

while cargo +1.67 test --lib --release test_rsakeypair_private_exponentiate -- --nocapture ; do :; done

I noticed that Rust 1.65 upgraded LLVM to LLVM 15 according to https://doc.rust-lang.org/beta/releases.html#version-1650-2022-11-03. However, Rust didn't upgrade to LLVM 16 until Rust 1.70 according to https://doc.rust-lang.org/beta/releases.html#version-1700-2023-06-01.

Unfortunately, simple steps to debug the issue are thwarted as, for example, making certain relevant functions #[inline(never)] or adding logging will cause the issue to stop reproducing. So the issue would need to be debugged in a debugger and/or by comparing disassemblies.

briansmith avatar May 30 '25 16:05 briansmith

I removed this from the 0.17.15 release milestone since it isn't a regression from 0.17.14.

briansmith avatar May 30 '25 16:05 briansmith

I removed this from the 0.17.15 release milestone since it isn't a regression from 0.17.14.

If I read the log correctly, the first release with the following commit was 0.17.9 and the last release without it was 0.17.8. However, I doubt that this commit is to blame.

commit bdce4fc0e7068e70e9172bda7dcaa63a1145ca94 (HEAD)
Author: Brian Smith <[email protected]>
Date:   Mon Feb 10 15:10:22 2025 -0800

    arithmetic: Reverse private exponent words during key loading.

    Instead of scanning the private exponent limbs backwards during
    exponentiation, reverse the limbs during key loading and then scan
    forward.

briansmith avatar May 30 '25 16:05 briansmith

Also, to clarify for others, until very recently, aarch64-pc-windows was BUILT in CI (on an x86-64 Windows runner) but the tests were not run. Only recently did we start running tests in CI for aarch64-pc-windows-msvc.

Previously, I was manually running the tests on an aarch64-pc-windows-msvc machine pre-release for the MSRV and latest stable toolchains when I found it prudent to do so. But previously MSRV was lower than 1.65 so we wouldn't have noticed it.

briansmith avatar May 30 '25 16:05 briansmith

A workaround, avoiding running the tests on aarch64-pc-windows-msvc 1.66 release, was put into CI in PR #2538.

PR #2550 undoes this workaround, replacing it with a minimally-scoped one that just skips this one test.

briansmith avatar May 30 '25 17:05 briansmith