miri icon indicating copy to clipboard operation
miri copied to clipboard

Miri builds fail nondeterministically

Open joshlf opened this issue 1 year ago • 23 comments

On https://github.com/google/zerocopy/pull/506, this job succeeded while this job failed. Both jobs are running the same code on the same commit, and ran within a few minutes of one another using a pinned version of the Rust toolchain (nightly-2023-10-15). Here's the error output from the failing job:

Error output
Preparing a sysroot for Miri (target: aarch64-unknown-linux-gnu)... done
   Compiling libc v0.2.149
   Compiling cfg-if v1.0.0
   Compiling rustversion v1.0.14
   Compiling ppv-lite86 v0.2.17
   Compiling either v1.9.0
   Compiling itertools v0.11.0
   Compiling getrandom v0.2.10
   Compiling rand_core v0.6.4
   Compiling rand_chacha v0.3.1
   Compiling rand v0.8.5
   Compiling testutil v0.0.0 (/home/runner/work/zerocopy/zerocopy/testutil)
error[E0463]: can't find crate for `cargo_metadata`
 --> testutil/src/lib.rs:5:5
  |
5 | use cargo_metadata::MetadataCommand;
  |     ^^^^^^^^^^^^^^ can't find crate

error[E0463]: can't find crate for `rustc_version`
 --> testutil/src/lib.rs:6:5
  |
6 | use rustc_version::{Channel, Version};
  |     ^^^^^^^^^^^^^ can't find crate
  |
  = help: maybe you need to install the missing components with: `rustup component add rust-src rustc-dev llvm-tools-preview`

error[E0463]: can't find crate for `time`
  --> testutil/src/lib.rs:99:24
   |
99 |             let desc = time::macros::format_description!("nightly-[year]-[month]-[day]");
   |                        ^^^^ can't find crate

error: cannot determine resolution for the macro `time::macros::format_description`
  --> testutil/src/lib.rs:99:24
   |
99 |             let desc = time::macros::format_description!("nightly-[year]-[month]-[day]");
   |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: import resolution is stuck, try simplifying macro imports

error[E0463]: can't find crate for `rustc_version`
  --> testutil/src/lib.rs:70:[23](https://github.com/google/zerocopy/actions/runs/6537994945/job/17753051531#step:11:24)
   |
70 |         let current = rustc_version::version_meta()?;
   |                       ^^^^^^^^^^^^^ can't find crate
   |
   = help: maybe you need to install the missing components with: `rustup component add rust-src rustc-dev llvm-tools-preview`

error[E0463]: can't find crate for `time`
   --> testutil/src/lib.rs:100:[24](https://github.com/google/zerocopy/actions/runs/6537994945/job/17753051531#step:11:25)
    |
100 |             let date = time::Date::parse(&pinned_versions.nightly, &desc).map_err(|_| {
    |                        ^^^^ can't find crate

error[E0463]: can't find crate for `time`
   --> testutil/src/lib.rs:103:[35](https://github.com/google/zerocopy/actions/runs/6537994945/job/17753051531#step:11:36)
    |
103 |             let adjusted = date - time::Duration::DAY;
    |                                   ^^^^ can't find crate

For more information about this error, try `rustc --explain E0463`.
error: could not compile `testutil` (lib) due to 7 previous errors

joshlf avatar Oct 16 '23 18:10 joshlf

Updating from nightly-2023-10-15 to nightly-2023-10-16 seems to have fixed the problem. I presume that means that it's already fixed on master.

joshlf avatar Oct 16 '23 19:10 joshlf

I do not think we did anything that would fix this problem.

saethlin avatar Oct 16 '23 19:10 saethlin

Looks like you're right. It may have been a statistical fluke that a few jobs succeeded in a row. It appears to be failing again with nightly-2023-10-16.

Interestingly, there are actually 5 jobs that seem to be perfectly correlated - in every run of jobs, they either all fail or all succeed. In the job I linked to in the previous paragraph, you can scroll down in the sidebar - they're clumped together as a single run of 5 red X marks:

Screenshot 2023-10-16 at 12 32 06 PM

joshlf avatar Oct 16 '23 19:10 joshlf

What those 5 jobs share in common is that they're all using the aarch64-unknown-linux-gnu target.

joshlf avatar Oct 16 '23 19:10 joshlf

Maybe this has to do with some sort of HashMap where iteration order matters? ui_test had some non-determinism related to that. But then I wouldn't expect this 5-way correlation. And we are not using HashMap in cargo-miri, so this would have to be something in cargo changing?

How exactly are you invoking Miri, and are any relevant environment variables set?

RalfJung avatar Oct 16 '23 22:10 RalfJung

This job contains the relevant information. Here's what's run:

  # Run under both the stacked borrows model (default) and under the tree 
  # borrows model to ensure we're compliant with both.
  for EXTRA_FLAGS in "" "-Zmiri-tree-borrows"; do
    MIRIFLAGS="$MIRIFLAGS $EXTRA_FLAGS" ./cargo.sh +nightly \
      miri test \
      --package zerocopy \
      --target aarch64-unknown-linux-gnu \
      --no-default-features
  done

The source for cargo.sh is here. The +nightly argument instructs it to extract from Cargo.toml the version of the nightly toolchain that we have pinned in CI (nightly-2023-10-16 in this job) and then invoke normal cargo using that version.

And here's the metadata, including env vars:

 env:
    CARGO_TERM_COLOR: always
    RUSTFLAGS: -Dwarnings -Zrandomize-layout
    RUSTDOCFLAGS: -Dwarnings
    ZC_NIGHTLY_RUSTFLAGS: -Zrandomize-layout
    ZC_NIGHTLY_MIRIFLAGS: -Zmiri-symbolic-alignment-check -Zmiri-strict-provenance -Zmiri-backtrace=full
    ZC_TOOLCHAIN: nightly-2023-10-16
    MIRIFLAGS:  -Zmiri-symbolic-alignment-check -Zmiri-strict-provenance -Zmiri-backtrace=full
    CARGO_INCREMENTAL: 0
    CACHE_ON_FAILURE: false

joshlf avatar Oct 16 '23 22:10 joshlf

At a glance, this looks to me like a cargo/rustc issue but then I don't know how to explain only seeing this with Miri.

I think this is somehow related to build caching, because we jump straight from preparing the sysroot to building test_util, no mention of the dependencies. But then we can't find them. So it seems like the system that determines what to build thinks dependencies are already compiled, but then the system that loads the compiled crates thinks they need to be compiled. Is it possible to download the contents of the CI caches that are in use here?

saethlin avatar Oct 16 '23 22:10 saethlin

Is it possible to download the contents of the CI caches that are in use here?

I doubt it, but it might help to look at the job that generates the cache. We also use Swatinem-rust-cache (see the "Rust Cache" step in the failing job). I'll be honest that I'm not sure how that caching interacts with our manual pre-populated caching. @Swatinem might know more.

joshlf avatar Oct 16 '23 22:10 joshlf

Okay, thanks. One thing that could help is to add -v to the cargo flags, then we should see a bit more of what happens.

Is there a .cargo/config.toml anywhere that would affect the build? We have one known caching issue (https://github.com/rust-lang/miri/issues/2540) but the symptoms look wrong and that requires a non-default target directory.

RalfJung avatar Oct 17 '23 05:10 RalfJung

But yeah the log looks really strange... it's not even trying to build time and the other missing crates! cargo-miri doesn't do any dependency handling, so this means cargo somehow decides it doesn't need to build the crate. We do funny stuff with the final binary but for dependency building we mostly let cargo do its thing.

My best guess right now is that it has to do with us replacing .rlib by .rmeta in --extern flags -- so if the rlib exists and cargo thinks that's enough then it might not build the crate but Miri expect the .rmeta file instead. And when the .rmeta file is missing that looks exactly like the error you are seeing. However, miri uses its own target directory and there never should be any .rlib files in there without corresponding .rmeta.

The output of tree target/miri -h -P "*.rlib" -P "*.rmeta" before the build could be interesting.

RalfJung avatar Oct 17 '23 05:10 RalfJung

Looking at that cache action, this here is suspicious:

Before being persisted, the cache is cleaned of:

  • Any build artifacts with an mtime older than one week.

Potentially this can lead to having the .rmeta file deleted while the .rlib stubs that cargo-miri creates to make cargo happy stick around. Though we only create the .rlib file when rustc is actually invoked so it should only be a few milliseconds younger than the .rmeta file...

But anyway, a cache action deleting individual files from the cache like that is definitely fragile and I would advice against it. We assume we are in control of our cache directory.

RalfJung avatar Oct 17 '23 06:10 RalfJung

Looking at that cache action, this here is suspicious:

Before being persisted, the cache is cleaned of:

  • Any build artifacts with an mtime older than one week.

Potentially this can lead to having the .rmeta file deleted while the .rlib stubs that cargo-miri creates to make cargo happy stick around. Though we only create the .rlib file when rustc is actually invoked so it should only be a few milliseconds younger than the .rmeta file...

But anyway, a cache action deleting individual files from the cache like that is definitely fragile and I would advice against it. We assume we are in control of our cache directory.

@Swatinem, I saw you reacted to this with 👍; do you know if there's something I can do to disable this behavior, are you just saying that you agree that this may be the culprit, that you think rust-cache should change its default behavior, or something else? Just curious if you have any insights since you know rust-cache better than any of us 🙂

Also, I've re-run the job, passing -vvv to cargo miri. Here's the result.

joshlf avatar Oct 17 '23 15:10 joshlf

I should also call attention to the fact that this job only fails for the aarch64-unknown-linux-gnu target. I'm surprised that build-related behavior differs by target, especially since we run Miri for a number of other non-host targets, including arm-unknown-linux-gnueabi. Maybe someone else who's more familiar with the build aspects of Miri, Cargo, or rustc might know what's significant about aarch64.

joshlf avatar Oct 17 '23 15:10 joshlf

yeah, that is very strange indeed. We've had target-specific cargo build issues with ui_test and it was due to target-specific dependencies. Whether that's the same here, no idea.

RalfJung avatar Oct 17 '23 15:10 RalfJung

Also, I've re-run the job, passing -vvv to cargo miri. Here's the result.

Sorry, but cargo miri only recognizes -v -v, not -vv :joy:

RalfJung avatar Oct 17 '23 15:10 RalfJung

Also, I've re-run the job, passing -vvv to cargo miri. Here's the result.

Sorry, but cargo miri only recognizes -v -v, not -vv 😂

Hahaha take two: https://github.com/google/zerocopy/actions/runs/6549769669/job/17787505483?pr=516

joshlf avatar Oct 17 '23 16:10 joshlf

Ah yes, my 👍🏻 was related to the comment that the cache cleanup is very naive and is just based on the filename pattern and mtime. And it also relies on cargo correctly rebuilding things that are missing.

So it is possible that only one of the artifacts is cleaned, and then cargo does not correctly rebuild the other one.

You can enable debug output of the action and then it should show you directly which files it deletes.

Swatinem avatar Oct 17 '23 18:10 Swatinem

Hm yeah that looks pretty much standard I would say. As suspected cargo doesn't even call us on time or the other missing crates.

Without a way to reproduce, I'm not sure what to say besides -- try disabling the caching for the Miri job (or cargo clean before the Miri job); Miri only does check-builds for all target crates so they should be fairly fast anyway. (The slowest part is usually building the sysroot and it looks like you are not caching that.)

The output of tree target/miri -h -P "*.rlib" -P "*.rmeta" would help confirm the suspicion that somehow the rlib files are preserved and the rmeta files are lost, but it wouldn't help explain how that situation comes about.

RalfJung avatar Oct 17 '23 18:10 RalfJung

Ah gotcha! Here's a run of the job with debugging enabled: https://github.com/google/zerocopy/actions/runs/6549173205/job/17792628437?pr=516

joshlf avatar Oct 17 '23 18:10 joshlf

Ah gotcha! Here's a run of the job with debugging enabled: https://github.com/google/zerocopy/actions/runs/6549173205/job/17792628437?pr=516

We probably need that for a successful run, so that we can see what it puts in the cache -- which then presumably poisons it for the next run? is that theory compatible with observations?

RalfJung avatar Oct 17 '23 18:10 RalfJung

Ah no we can see what it extracts in the "Rust Cache" job. That actually looks reasonable

target/miri/aarch64-unknown-linux-gnu/debug/deps/libtime-3900e297511e97bd.rmeta
target/miri/aarch64-unknown-linux-gnu/debug/deps/libtime-3900e297511e97bd.rlib

RalfJung avatar Oct 17 '23 19:10 RalfJung

I added a workaround: https://github.com/google/zerocopy/pull/519

Still no clue why this is only happening on aarch64, but the workaround seems to do the trick 🤷‍♂️

joshlf avatar Oct 18 '23 17:10 joshlf

Good to know that clearing the cache helps.

But yeah this is still very strange...

RalfJung avatar Oct 18 '23 20:10 RalfJung