ginkgo icon indicating copy to clipboard operation
ginkgo copied to clipboard

Add -showDebugAddress to enable debugging hanging parallel tests

Open onsi opened this issue 6 years ago • 17 comments

The experience when parallel tests hang currently sucks. (See #323 and #206)

I had some time recently to try out a few ideas and landed on this PR that I'd like feedback on.

The basic premise is that when you now run ginkgo -p -showDebugAddress Ginkgo will emit a debug http address that delivers a JSON payload that represents:

  • the current running nodes
  • the current running test on each node
  • any stdout output emitted by that test so far
  • any GinkgoWriter writes emitted by that test so far
  • when the test started running

The hope is that this information can be useful to help debug the tests. The JSON output is a bit noisy (it's a serialization of some of the reporter types).

This would ultimately benefit from some sort of user-friendly client (e.g. ginkgo debug <DEBUG ADDRESS> could emit some pretty printed output; or one could imagine a simple web client to render the output nicely). There are also alternative ways to do this (e.g. send a signal to ginkgo to emit this JSON to stdout instead of hitting an HTTP endpoint).

Would appreciate any and all feedback and also contributions to improve the user experience.

onsi avatar Apr 25 '18 22:04 onsi

If this information is available on the coordinating node why wouldn't a combination of running ginkgo with a new flag (sorry couldn't come up with a good name) & C-c be sufficient. If you interrupt ginkgo in this new mode it could print the following for any child node that is still running:

  1. send it sigabrt to capture the stack trace in the stderr
  2. wait for the node to exit
  3. print the current test that is running on that node and when it was started
  4. GingkgoWriter output
  5. Standrad out output
  6. Standard err output (which should include the stacktrace dump)

I would like to point out that the stacktrace is very important to me when debugging hanging tests.

jvshahid avatar Apr 26 '18 16:04 jvshahid

I was imagining you might want to get visibility into the test without it hanging, but could explore some of these options.

onsi avatar Apr 26 '18 16:04 onsi

Actually - I've received an alternative solution suggestion that would be easier to implement and reason about. Instead, what if:

ginkgo -p -debug (for example)

caused each node to stream its output to a file on disk. Then the user could examine those files to understand what each node is doing and whether it is stuck.

In this mode a signal could be interpreted by the coordinating process as a prompt to have the nodes emit their stack trace - perhaps to the same file.

Thoughts?

onsi avatar Apr 26 '18 16:04 onsi

A bit more on the 'files on disk' suggestion:

  • Initially it could be controlled by an environment variable, e.g. GINKGO_DEBUG_DIR
  • it would be good to specify the directory as you might want to put the output somewhere that it can be captured (for instance if running in Concourse)
  • the style of files could be similar to UNIX syslog - i.e. a high resolution timestamp followed by human-readable text
  • I don't think there's any need to do clever log rotation etc... for the first implementation
  • Another (even simpler) implementation would be to have each worker log to STDOUT. This could be controlled by a GINKGO_DEBUG environment variable, or with multiple uses of the -v flag - a bit like ssh which allows things like ssh -v -v -v.

blgm avatar Apr 26 '18 16:04 blgm

I think I lean towards a flag over an env-var as it is ginkgo's primary interaction model. Can you share why environment variables might be preferred?

onsi avatar Apr 26 '18 16:04 onsi

My only concern with the files on disk is it could be harder to get those files if the tests are running on concourse for example. Collecting the information and printing it to stdout could help debugging hanging tests in CI if concourse sends a SIGTERM and allow enough time for ginkgo to print the information to stdout

jvshahid avatar Apr 26 '18 17:04 jvshahid

Files on disk is easy to look at when debugging failures in CI with hijack, etc, and easy to follow live locally with tail -f.

stdout isn't suitable, both because of redaction concerns in CI, and because of difficulties watching output from multiple nodes in a single stream when watching locally.

I prefer writing to a file as a solution over both an API and stdout.

anEXPer avatar Apr 26 '18 17:04 anEXPer

@onsi, I don’t have a strong view on whether it’s a flag or an environment variable. I would lean towards an environment variable if it’s a “pre-release” feature that I want feedback from specific users, and might change radically. And a command line flag is more appropriate for something that’s stable and is ready for used by anyone.

blgm avatar Apr 27 '18 12:04 blgm

I've pushed a file-based approach to parallel-debug-file. Would love folks to take a look and send feedback. @williammartin might need your help vendoring a couple of dependencies that needed to get pulled in if we choose to go down this road.

With this approach

ginkgo -p -debug

will emit output from each node to files named ginkgo-node-N.log in the test's directory.

onsi avatar May 06 '18 20:05 onsi

@onsi, after what must seem like a geological age, here are some very quick thoughts on the parallel-debug-file branch:

  • If you run ginko recursively, then you get a lot of files which are a bit tricky to clean up. You could add them to .gitignore obviously. I just wonder whether a better user experience would be to put them somewhere else. But that could be fixed in a later iteration.
  • It would be nice if the lines had ISO 8601 timestamps at the start, a bit like a traditional UNIX syslog. That's particularly useful when looking at potential hangs as you have some idea about when the last line was written.
  • I wouldn't worry too much about vendoring just yet. The branch works in Travis. Vendoring in Go is very much in flux, and there are different solutions for different versions of Go.

blgm avatar May 29 '18 21:05 blgm

thanks for taking a look @blgm

I'd love to find an interface that works well. I think the file approach has merit and agree that ending up with lots of files all over the place isn't great. Any thoughts on an alternative design?

As for timestamps. Unfortunately given the syscall hackery involved in rerouting stdout I don't think I can inject timestamps.

onsi avatar May 30 '18 00:05 onsi

Perhaps creating them all in the same directory with a name scheme like ginkgo-node-U-N.log where U is some kind of ID? But it would make it harder to correlate files to the code that they relate to.

I'd be in favour of merging this as is, with a caveat that it's an experimental feature subject to change. Most users would not add the --debug flag so would not see any difference. But users who observe occasional Ginkgo hangs would have a tool to help them out. We could then get feedback from these users, and base any future changes on that feedback.

blgm avatar May 30 '18 16:05 blgm

Just out of curiosity, what were the reasons to not follow the suggestion from @vito in https://github.com/onsi/ginkgo/issues/323 ?

I am not super familiar with the way Ginkgo handles parallelism, but each node could trap SIGQUIT and output the stack dump to a known file. Maybe we could also create a new process group with the coordinating node and the nodes running the test, so that a user could kill them all with one command and get the stack dumps.

The advantage of this approach is that there's no need to start Ginkgo with special flags. A disadvantage is that maybe the stack dump is harder to read than logs from the nodes.

nodo avatar May 31 '18 08:05 nodo

Oh I have missed the second comment on this thread.

I was imagining you might want to get visibility into the test without it hanging, but could explore some of these options.

^_ fair enough, but it assumes users are starting ginkgo with the --debug flag. Which is a bit strange... unless they are having issues :)

nodo avatar May 31 '18 08:05 nodo

Good point @nodo

We should update the docs here: https://onsi.github.io/ginkgo/#ginkgo-and-continuous-integration

To recommend that folks add the debug flag in CI environments

onsi avatar Jul 10 '18 18:07 onsi

I've merged the file solution via #499. I'll leave this open to remind us to update the docs.

blgm avatar Jul 10 '18 22:07 blgm

I'll leave this open to remind us to update the docs.

Is it working @blgm ? :)

bruce-ricard avatar Apr 15 '19 19:04 bruce-ricard