miri
miri copied to clipboard
Miri builds fail nondeterministically
On https://github.com/google/zerocopy/pull/506, this job succeeded while this job failed. Both jobs are running the same code on the same commit, and ran within a few minutes of one another using a pinned version of the Rust toolchain (nightly-2023-10-15). Here's the error output from the failing job:
Error output
Preparing a sysroot for Miri (target: aarch64-unknown-linux-gnu)... done
Compiling libc v0.2.149
Compiling cfg-if v1.0.0
Compiling rustversion v1.0.14
Compiling ppv-lite86 v0.2.17
Compiling either v1.9.0
Compiling itertools v0.11.0
Compiling getrandom v0.2.10
Compiling rand_core v0.6.4
Compiling rand_chacha v0.3.1
Compiling rand v0.8.5
Compiling testutil v0.0.0 (/home/runner/work/zerocopy/zerocopy/testutil)
error[E0463]: can't find crate for `cargo_metadata`
--> testutil/src/lib.rs:5:5
|
5 | use cargo_metadata::MetadataCommand;
| ^^^^^^^^^^^^^^ can't find crate
error[E0463]: can't find crate for `rustc_version`
--> testutil/src/lib.rs:6:5
|
6 | use rustc_version::{Channel, Version};
| ^^^^^^^^^^^^^ can't find crate
|
= help: maybe you need to install the missing components with: `rustup component add rust-src rustc-dev llvm-tools-preview`
error[E0463]: can't find crate for `time`
--> testutil/src/lib.rs:99:24
|
99 | let desc = time::macros::format_description!("nightly-[year]-[month]-[day]");
| ^^^^ can't find crate
error: cannot determine resolution for the macro `time::macros::format_description`
--> testutil/src/lib.rs:99:24
|
99 | let desc = time::macros::format_description!("nightly-[year]-[month]-[day]");
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: import resolution is stuck, try simplifying macro imports
error[E0463]: can't find crate for `rustc_version`
--> testutil/src/lib.rs:70:[23](https://github.com/google/zerocopy/actions/runs/6537994945/job/17753051531#step:11:24)
|
70 | let current = rustc_version::version_meta()?;
| ^^^^^^^^^^^^^ can't find crate
|
= help: maybe you need to install the missing components with: `rustup component add rust-src rustc-dev llvm-tools-preview`
error[E0463]: can't find crate for `time`
--> testutil/src/lib.rs:100:[24](https://github.com/google/zerocopy/actions/runs/6537994945/job/17753051531#step:11:25)
|
100 | let date = time::Date::parse(&pinned_versions.nightly, &desc).map_err(|_| {
| ^^^^ can't find crate
error[E0463]: can't find crate for `time`
--> testutil/src/lib.rs:103:[35](https://github.com/google/zerocopy/actions/runs/6537994945/job/17753051531#step:11:36)
|
103 | let adjusted = date - time::Duration::DAY;
| ^^^^ can't find crate
For more information about this error, try `rustc --explain E0463`.
error: could not compile `testutil` (lib) due to 7 previous errors
Updating from nightly-2023-10-15 to nightly-2023-10-16 seems to have fixed the problem. I presume that means that it's already fixed on master
.
I do not think we did anything that would fix this problem.
Looks like you're right. It may have been a statistical fluke that a few jobs succeeded in a row. It appears to be failing again with nightly-2023-10-16.
Interestingly, there are actually 5 jobs that seem to be perfectly correlated - in every run of jobs, they either all fail or all succeed. In the job I linked to in the previous paragraph, you can scroll down in the sidebar - they're clumped together as a single run of 5 red X marks:
What those 5 jobs share in common is that they're all using the aarch64-unknown-linux-gnu
target.
Maybe this has to do with some sort of HashMap where iteration order matters? ui_test had some non-determinism related to that. But then I wouldn't expect this 5-way correlation. And we are not using HashMap in cargo-miri, so this would have to be something in cargo changing?
How exactly are you invoking Miri, and are any relevant environment variables set?
This job contains the relevant information. Here's what's run:
# Run under both the stacked borrows model (default) and under the tree
# borrows model to ensure we're compliant with both.
for EXTRA_FLAGS in "" "-Zmiri-tree-borrows"; do
MIRIFLAGS="$MIRIFLAGS $EXTRA_FLAGS" ./cargo.sh +nightly \
miri test \
--package zerocopy \
--target aarch64-unknown-linux-gnu \
--no-default-features
done
The source for cargo.sh
is here. The +nightly
argument instructs it to extract from Cargo.toml the version of the nightly toolchain that we have pinned in CI (nightly-2023-10-16 in this job) and then invoke normal cargo
using that version.
And here's the metadata, including env vars:
env:
CARGO_TERM_COLOR: always
RUSTFLAGS: -Dwarnings -Zrandomize-layout
RUSTDOCFLAGS: -Dwarnings
ZC_NIGHTLY_RUSTFLAGS: -Zrandomize-layout
ZC_NIGHTLY_MIRIFLAGS: -Zmiri-symbolic-alignment-check -Zmiri-strict-provenance -Zmiri-backtrace=full
ZC_TOOLCHAIN: nightly-2023-10-16
MIRIFLAGS: -Zmiri-symbolic-alignment-check -Zmiri-strict-provenance -Zmiri-backtrace=full
CARGO_INCREMENTAL: 0
CACHE_ON_FAILURE: false
At a glance, this looks to me like a cargo/rustc issue but then I don't know how to explain only seeing this with Miri.
I think this is somehow related to build caching, because we jump straight from preparing the sysroot to building test_util
, no mention of the dependencies. But then we can't find them. So it seems like the system that determines what to build thinks dependencies are already compiled, but then the system that loads the compiled crates thinks they need to be compiled. Is it possible to download the contents of the CI caches that are in use here?
Is it possible to download the contents of the CI caches that are in use here?
I doubt it, but it might help to look at the job that generates the cache. We also use Swatinem-rust-cache (see the "Rust Cache" step in the failing job). I'll be honest that I'm not sure how that caching interacts with our manual pre-populated caching. @Swatinem might know more.
Okay, thanks. One thing that could help is to add -v
to the cargo flags, then we should see a bit more of what happens.
Is there a .cargo/config.toml
anywhere that would affect the build? We have one known caching issue (https://github.com/rust-lang/miri/issues/2540) but the symptoms look wrong and that requires a non-default target directory.
But yeah the log looks really strange... it's not even trying to build time
and the other missing crates! cargo-miri doesn't do any dependency handling, so this means cargo somehow decides it doesn't need to build the crate. We do funny stuff with the final binary but for dependency building we mostly let cargo do its thing.
My best guess right now is that it has to do with us replacing .rlib
by .rmeta
in --extern
flags -- so if the rlib exists and cargo thinks that's enough then it might not build the crate but Miri expect the .rmeta
file instead. And when the .rmeta file is missing that looks exactly like the error you are seeing. However, miri uses its own target directory and there never should be any .rlib
files in there without corresponding .rmeta
.
The output of tree target/miri -h -P "*.rlib" -P "*.rmeta"
before the build could be interesting.
Looking at that cache action, this here is suspicious:
Before being persisted, the cache is cleaned of:
- Any build artifacts with an mtime older than one week.
Potentially this can lead to having the .rmeta file deleted while the .rlib stubs that cargo-miri creates to make cargo happy stick around. Though we only create the .rlib file when rustc is actually invoked so it should only be a few milliseconds younger than the .rmeta file...
But anyway, a cache action deleting individual files from the cache like that is definitely fragile and I would advice against it. We assume we are in control of our cache directory.
Looking at that cache action, this here is suspicious:
Before being persisted, the cache is cleaned of:
- Any build artifacts with an mtime older than one week.
Potentially this can lead to having the .rmeta file deleted while the .rlib stubs that cargo-miri creates to make cargo happy stick around. Though we only create the .rlib file when rustc is actually invoked so it should only be a few milliseconds younger than the .rmeta file...
But anyway, a cache action deleting individual files from the cache like that is definitely fragile and I would advice against it. We assume we are in control of our cache directory.
@Swatinem, I saw you reacted to this with 👍; do you know if there's something I can do to disable this behavior, are you just saying that you agree that this may be the culprit, that you think rust-cache should change its default behavior, or something else? Just curious if you have any insights since you know rust-cache better than any of us 🙂
Also, I've re-run the job, passing -vvv
to cargo miri
. Here's the result.
I should also call attention to the fact that this job only fails for the aarch64-unknown-linux-gnu
target. I'm surprised that build-related behavior differs by target, especially since we run Miri for a number of other non-host targets, including arm-unknown-linux-gnueabi
. Maybe someone else who's more familiar with the build aspects of Miri, Cargo, or rustc might know what's significant about aarch64.
yeah, that is very strange indeed. We've had target-specific cargo build issues with ui_test and it was due to target-specific dependencies. Whether that's the same here, no idea.
Also, I've re-run the job, passing -vvv to cargo miri. Here's the result.
Sorry, but cargo miri
only recognizes -v -v
, not -vv
:joy:
Also, I've re-run the job, passing -vvv to cargo miri. Here's the result.
Sorry, but
cargo miri
only recognizes-v -v
, not-vv
😂
Hahaha take two: https://github.com/google/zerocopy/actions/runs/6549769669/job/17787505483?pr=516
Ah yes, my 👍🏻 was related to the comment that the cache cleanup is very naive and is just based on the filename pattern and mtime. And it also relies on cargo correctly rebuilding things that are missing.
So it is possible that only one of the artifacts is cleaned, and then cargo does not correctly rebuild the other one.
You can enable debug output of the action and then it should show you directly which files it deletes.
Hm yeah that looks pretty much standard I would say. As suspected cargo doesn't even call us on time
or the other missing crates.
Without a way to reproduce, I'm not sure what to say besides -- try disabling the caching for the Miri job (or cargo clean
before the Miri job); Miri only does check-builds for all target crates so they should be fairly fast anyway. (The slowest part is usually building the sysroot and it looks like you are not caching that.)
The output of tree target/miri -h -P "*.rlib" -P "*.rmeta"
would help confirm the suspicion that somehow the rlib files are preserved and the rmeta files are lost, but it wouldn't help explain how that situation comes about.
Ah gotcha! Here's a run of the job with debugging enabled: https://github.com/google/zerocopy/actions/runs/6549173205/job/17792628437?pr=516
Ah gotcha! Here's a run of the job with debugging enabled: https://github.com/google/zerocopy/actions/runs/6549173205/job/17792628437?pr=516
We probably need that for a successful run, so that we can see what it puts in the cache -- which then presumably poisons it for the next run? is that theory compatible with observations?
Ah no we can see what it extracts in the "Rust Cache" job. That actually looks reasonable
target/miri/aarch64-unknown-linux-gnu/debug/deps/libtime-3900e297511e97bd.rmeta
target/miri/aarch64-unknown-linux-gnu/debug/deps/libtime-3900e297511e97bd.rlib
I added a workaround: https://github.com/google/zerocopy/pull/519
Still no clue why this is only happening on aarch64, but the workaround seems to do the trick 🤷♂️
Good to know that clearing the cache helps.
But yeah this is still very strange...