tarpaulin icon indicating copy to clipboard operation
tarpaulin copied to clipboard

tarpaulin stopped working for veloren

Open xMAC94x opened this issue 3 years ago • 16 comments

Hi xd009642,

Veloren stopped working with tarpaulin a few weeks ago. And i am investigating what caused this problem. I am currently debugging tarpaulin and wonder if i found a bug.

Situation. When you compile veloren (steps to reproduce below) you'll see that the compilcation will never finish. it will just halt. I attached gdb to my local cargo-tarpaulin, wait for it to get stuck and i'll see:

(gdb) bt
#0  0x00007ffff7a2feaa in wait4 () from /usr/lib/libc.so.6
#1  0x000055555639b270 in std::sys::unix::process::process_inner::Process::wait::{{closure}} () at src/libstd/sys/unix/process/process_unix.rs:438
#2  std::sys::unix::cvt_r () at src/libstd/sys/unix/mod.rs:152
#3  std::sys::unix::process::process_inner::Process::wait () at src/libstd/sys/unix/process/process_unix.rs:438
#4  std::process::Child::wait () at src/libstd/process.rs:1394
#5  0x00005555559a80ff in cargo_tarpaulin::cargo::run_cargo (metadata=0x7fffffff4e28, manifest=..., config=0x55555681c9f0, ty=..., result=0x7fffffff4df0) at src/cargo.rs:160
#6  0x00005555559a6eb9 in cargo_tarpaulin::cargo::get_tests (config=0x55555681c9f0) at src/cargo.rs:116
#7  0x00005555558b83ab in cargo_tarpaulin::launch_tarpaulin (config=0x55555681c9f0, logger=0x7fffffff6cd8) at src/lib.rs:144
#8  0x00005555558b50e9 in cargo_tarpaulin::trace (configs=...) at src/lib.rs:65
#9  0x00005555558b6593 in cargo_tarpaulin::run (configs=...) at src/lib.rs:100
#10 0x00005555556a5125 in cargo_tarpaulin::main () at src/main.rs:194

Line 5 suggests its stuck in src/cargo.rs:160. Looking at the code: https://github.com/xd009642/tarpaulin/blob/develop/src/cargo.rs#L160 Okay my guess is, that tarpaulin just calls the system cargo here, and tarpaulin does just wait on cargo. So maybe the problem is not with tarpaulin but with cargo ???

Okay, i want to investigate further and analyse what cargo is doing. so i grap the PID from the child and attach a gdb there:

Sep 29 14:30:05.716 TRACE cargo_tarpaulin::cargo: Running command "cargo" "test" "--no-run" "--message-format" "json" "--manifest-path" "/mnt/nfs/marcel/Entw/Rust/veloren/Cargo.toml" "--tests" "-vvv" "--target-dir" "/mnt/games/cargo-build/"
[Detaching after vfork from child process 167525]

GDB of the 2nd process, the cargo process:

^C
Thread 1 "cargo" received signal SIGINT, Interrupt.
0x00007ffff7e99f9f in write () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7e99f9f in write () from /usr/lib/libc.so.6
#1  0x0000555555fc9e6d in std::sys::unix::fd::FileDesc::write () at library/std/src/sys/unix/fd.rs:139
#2  <std::sys::unix::stdio::Stdout as std::io::Write>::write () at library/std/src/sys/unix/stdio.rs:38
#3  <std::io::stdio::StdoutRaw as std::io::Write>::write () at library/std/src/io/stdio.rs:117
#4  <std::io::buffered::BufWriter<W> as std::io::Write>::write () at library/std/src/io/buffered.rs:758
#5  <std::io::buffered::LineWriterShim<W> as std::io::Write>::write () at library/std/src/io/buffered.rs:1003
#6  <std::io::buffered::LineWriter<W> as std::io::Write>::write () at library/std/src/io/buffered.rs:1396
#7  <std::io::stdio::StdoutLock as std::io::Write>::write () at library/std/src/io/stdio.rs:642
#8  0x0000555555fc97c7 in <&std::io::stdio::Stdout as std::io::Write>::write () at library/std/src/io/stdio.rs:616
#9  <std::io::stdio::Stdout as std::io::Write>::write () at library/std/src/io/stdio.rs:590
#10 0x000055555584137a in <termcolor::LossyStandardStream<W> as std::io::Write>::write ()
#11 0x0000555555811dc5 in std::io::Write::write_all ()
#12 0x0000555555714b81 in <std::io::Write::write_fmt::Adaptor<T> as core::fmt::Write>::write_str ()
#13 0x0000555555fff13c in core::fmt::write () at library/core/src/fmt/mod.rs:1080
#14 0x0000555555812105 in std::io::Write::write_fmt ()
#15 0x000055555594a466 in cargo::core::compiler::job_queue::DrainState::drain_the_queue ()
#16 0x00005555558510b3 in std::panic::catch_unwind ()
#17 0x00005555557c9197 in crossbeam_utils::thread::scope ()
#18 0x0000555555948195 in cargo::core::compiler::job_queue::JobQueue::execute ()
#19 0x0000555555794c7f in cargo::core::compiler::context::Context::compile ()
#20 0x0000555555726bdc in cargo::ops::cargo_compile::compile_ws ()
#21 0x000055555572693e in cargo::ops::cargo_compile::compile ()
#22 0x0000555555addd22 in cargo::ops::cargo_test::compile_tests ()
#23 0x0000555555adaadf in cargo::ops::cargo_test::run_tests ()
#24 0x00005555556be3fd in cargo::commands::test::exec ()
#25 0x00005555556da8e2 in cargo::cli::main ()
#26 0x0000555555677008 in cargo::main ()
#27 0x00005555556ca223 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#28 0x00005555556ca489 in std::rt::lang_start::{{closure}} ()
#29 0x0000555555fd93e0 in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once () at /rustc/623fb90b5a1f324e0ec44085116bf858cef19a00/library/core/src/ops/function.rs:259
#30 std::panicking::try::do_call () at library/std/src/panicking.rs:381
#31 std::panicking::try () at library/std/src/panicking.rs:345
#32 std::panic::catch_unwind () at library/std/src/panic.rs:382
#33 std::rt::lang_start_internal () at library/std/src/rt.rs:51
#34 0x0000555555679b32 in main ()

and i'll see that the cargo process started by tarpaulin is waiting on some write. Which sounds really strange to me. I checked my harddrive, but it's not writing to disk. also in line 8 i see stdio which lets me to my hypothesis: Running the cargo command manually, also finished. but running it from tarpaulin doesn't finish. which seems super weird to me.

tarpaulin starts cargo and waits for its result. cargo wants to write something in a pipe back to tarpaulin. but this pipe is full. cargo waits for this pipe is cleared. tarpaulin doesn't read from the pipe but wants to wait till the process is finished... Voila we got a deadlock. this would only occur on big projects (like veloren is) and could be solved by tarpaulin not just waiting, but also storing the stdio somewhere.

Do you think that could be a realistic root cause ?

xMAC94x avatar Sep 29 '20 12:09 xMAC94x

That does sound like a potential issue, I'd have to double check the docs but given there it's returning with an error it should be okay to not wait and just let the child be dropped, it should detach on drop and then any pipes will close and we won't be left with a zombie (I think).

xd009642 avatar Sep 29 '20 12:09 xd009642

Oh sorry. i forgot to add the steps to repdoduce. So it does not fail. It just generates a HUGE output. Thats why the child will never be dropped.

So here's how to reproduce.

  1. Grab veloren https://gitlab.com/veloren/veloren and checkout xMAC94x/fix-tarpaulin. save this in /tmp/veloren.
  2. run
cargo" "test" "--no-run" "--message-format" "json" "--manifest-path" "/tmp/veloren/Cargo.toml" "--tests" "-vvv"

it works. 3. run tarpaulin. It will hang forever.

xMAC94x avatar Sep 29 '20 13:09 xMAC94x

So I'll try to look at this properly later in the week or the weekend. But my suggestion was more along the lines of deleting the call to wait at https://github.com/xd009642/tarpaulin/blob/develop/src/cargo.rs#L160 and wondering if that could just solve all the issues just through what happens when a Command is dropped 🤔

xd009642 avatar Sep 29 '20 20:09 xd009642

@xd009642 i created a minimal reproduction scenario (just include vek with platform_intrinsics https://github.com/xMAC94x/testtarpaulin it shows additional logs with the PR i just created.

xMAC94x avatar Oct 02 '20 17:10 xMAC94x

well the issue is -C link-dead-code not working with vek. I wonder if this is an issue that should be raised in the compiler :thinking: as for the hang I'm going to work on a solution. For the meantime with veloren I guess the easiest solution would be to make platform_intrinsics an optional (still can be a default feature) and then run tarpaulin without that feature enabled.

xd009642 avatar Oct 04 '20 13:10 xd009642

sounds reasonable. But do you know how to change: vek = {version = "0.12.0", features = ["platform_intrinsics", "serde"]} so, hat serde is always requiered, but platform_intrinsics is depending on a feature ?

Btw the MR i created traces the information, it might make sense for other people that face similar issues to get more information that just a endless wait.

xMAC94x avatar Oct 04 '20 14:10 xMAC94x

I don't know how to solve that feature issue, a quick google pointed me towards an RFC so it might not be possible. Also, the fix I'm working on changes the bit of code where your PR is and might make it unneeded, I was just going to run some tests first :eyes:

xd009642 avatar Oct 04 '20 14:10 xd009642

I've pushed a change to cargo_hang I need to figure out a bit more since the error logs I added for debugging didn't happen as I'd expect :thinking: but if you want to try it locally and confirm it "fixes" things on veloren regarding the hang. Also, I raised this issue https://github.com/rust-lang/rust/issues/77529

xd009642 avatar Oct 04 '20 15:10 xd009642

@xd009642 just tested it, the branch cargo_hang fixes the endless loop situation but the error still exists. Thanks for opening the issue :)

xMAC94x avatar Oct 05 '20 21:10 xMAC94x

Cool I'll merge in cargo_hang then and just keep this issue open to remind me to keep checking up on the dead-code issue :+1:. EDIT: cargo_hang is now merged into develop, I'll delete the feature branch in a few days probably

xd009642 avatar Oct 05 '20 21:10 xd009642

Thanks for fixing this! Any recommendations on the best way to get this into a repo (maybe a release)? (e.g. https://github.com/rust-bio/rust-bio/issues/374 is potentially caused by this)

tshauck avatar Oct 13 '20 03:10 tshauck

I can always do a release this week. Or there are tarpaulin docker images built against the develop branch or cargo can install given a git URL and branch name. If I have some time tonight I'll try to look at your repo and find out which features cause the issue

xd009642 avatar Oct 13 '20 08:10 xd009642

Cool, thanks. It's probably not a big deal to go one of those ways if you can't do a release in the intermediate. The package uses the rust github action to install tarpaulin, so being able to go that way but would be nice, but like you mention, several workarounds.

tshauck avatar Oct 13 '20 19:10 tshauck

It seems like we no longer have the timeouts over at rust-bio. But we do have a compilation error of tests, now, that only occurs when using tarpaulin and not when using regular cargo test. So it could be that this error simply triggers before any timeout.

dlaehnemann avatar Oct 19 '20 20:10 dlaehnemann

@dlaehnemann IIRC tarpalin was fixed so that it no longer deadlocks on this error (for us it was a deadlock, not a timeout). And we have a similar error in our pipeline now, i think the rustlang error posted above is a reason for this error

xMAC94x avatar Oct 19 '20 22:10 xMAC94x

@xMAC94x just released a new version of tarpaulin, if you could let me know if there are any issues with it and veloren :smile:

xd009642 avatar Jun 28 '21 18:06 xd009642

Closing this issue as the llvm coverage backend is now released as an alternative and there haven't been any updates here in over a year (as always feel free to reopen if it's been closed in haste)

xd009642 avatar Oct 13 '22 17:10 xd009642

Closing this issue as the llvm coverage backend is now released as an alternative and there haven't been any updates here in over a year (as always feel free to reopen if it's been closed in haste)

Is -C link-dead-code not necessary or less important with the llvm coverage backend? Asking since it seems to be implied but I haven't found any information on this.

Imberflur avatar Feb 02 '24 21:02 Imberflur