cvise icon indicating copy to clipboard operation
cvise copied to clipboard

cvise stops intermittently

Open avikivity opened this issue 1 year ago • 7 comments

Occasionally I see

[1]+  Stopped                 cvise --clang-delta-std=c++20 --print-diff ./check.sh sstable_datafile_test.cc

And I have to restart the job with fg. This is of course problematic for unattended runs.

The interestingness test runs gdb in batch mode (gdb also runs the program). Perhaps gdb signals interfere with cvise?

avikivity avatar May 25 '23 09:05 avikivity

A workaround is to send SIGCONT in a loop from some shell script, but I'd like to understand and fix it.

avikivity avatar May 25 '23 09:05 avikivity

Hmm, I haven't seen this behavior during my cvise use. So you say that the master process (cvise ...) got moved to the background and so that it's stopped? All the interestingness tests are run in a separate sub-process (using Pebble) library and it should not interact with the master process at all. Can you get a Python back-trace of the master process when it gets moved to the background? Does it really happen only if the int. test uses gdb?

marxin avatar May 26 '23 09:05 marxin

I don't know if it was moved into the background, or if something else happened.

I don't use cvise very often (but when I do, it's for multi-day reductions), so I can't tell if it's related to gdb or not. It seems likely since gdb plays with signals.

I don't know how to generate a Python backtrace (and if it's stopped, I'm sure I won't get once).

I guess a workaround is to package the interestingness test into a container, this should isolate any signals leakage. Still, it would be nice if cvise protected itself from this.

avikivity avatar May 28 '23 14:05 avikivity

I don't know if it was moved into the background, or if something else happened.

Well the described behavior seems pretty unusual.

I don't use cvise very often (but when I do, it's for multi-day reductions), so I can't tell if it's related to gdb or not. It seems likely since gdb plays with signals.

Anyway, can you please attach a reproduces I can run locally a try to reproduce it?

I guess a workaround is to package the interestingness test into a container, this should isolate any signals leakage. Still, it would be nice if cvise protected itself from this.

Well, that sounds like a solution, but C-Vise should not behaved like you described ;)

marxin avatar Jun 02 '23 18:06 marxin

https://github.com/avikivity/scylladb/commits/bug-13730-investigation

Steps to reproduce:

  1. clone the repo into a Fedora 38 installation (or anything with clang 16 + all the dependencies)
  2. run ./cvise.sh
  3. wait for long, long hours

There's a container image with all the dependencies: docker.io/scylladb/scylla-toolchain:fedora-38-20230517 However, I did not try reproducing within the container, only on my Fedora 38 host. Note you'll need to run the container as --privileged since ptrace isn't available otherwise.

The problem reproduces rarely. I have a feeling it happens when the pass changes, but it hasn't happened enough times, and usually I wasn't looking when it did.

avikivity avatar Jun 04 '23 16:06 avikivity

Thanks for the reproducer. Note I'm changing a job right now and I will get to it in one month from now when I'll have a reasonable powerful machine to reproduce it on. Hope it's fine?

marxin avatar Jun 05 '23 13:06 marxin

All right, so I've changed a job and got a reasonably fast desktop machine.

Looking at your reproducer: can you please create a container (the provided one docker.io/scylladb/scylla-toolchain:fedora-38-20230517 seems to be unavailable), add there your git branch, and provide me a link, thanks! Note it seems one needs to have built things like /home/avi/scylla/build/release/seastar/libseastar.a (and probable others) in order to link the sstable_datafile_test_g binary.

marxin avatar Jul 07 '23 15:07 marxin