ginkgo
ginkgo copied to clipboard
Add -showDebugAddress to enable debugging hanging parallel tests
The experience when parallel tests hang currently sucks. (See #323 and #206)
I had some time recently to try out a few ideas and landed on this PR that I'd like feedback on.
The basic premise is that when you now run ginkgo -p -showDebugAddress
Ginkgo will emit a debug http address that delivers a JSON payload that represents:
- the current running nodes
- the current running test on each node
- any stdout output emitted by that test so far
- any
GinkgoWriter
writes emitted by that test so far - when the test started running
The hope is that this information can be useful to help debug the tests. The JSON output is a bit noisy (it's a serialization of some of the reporter
types).
This would ultimately benefit from some sort of user-friendly client (e.g. ginkgo debug <DEBUG ADDRESS>
could emit some pretty printed output; or one could imagine a simple web client to render the output nicely). There are also alternative ways to do this (e.g. send a signal to ginkgo
to emit this JSON to stdout instead of hitting an HTTP endpoint).
Would appreciate any and all feedback and also contributions to improve the user experience.
If this information is available on the coordinating node why wouldn't a combination of running ginkgo with a new flag (sorry couldn't come up with a good name) & C-c
be sufficient. If you interrupt ginkgo in this new mode it could print the following for any child node that is still running:
- send it sigabrt to capture the stack trace in the stderr
- wait for the node to exit
- print the current test that is running on that node and when it was started
- GingkgoWriter output
- Standrad out output
- Standard err output (which should include the stacktrace dump)
I would like to point out that the stacktrace is very important to me when debugging hanging tests.
I was imagining you might want to get visibility into the test without it hanging, but could explore some of these options.
Actually - I've received an alternative solution suggestion that would be easier to implement and reason about. Instead, what if:
ginkgo -p -debug
(for example)
caused each node to stream its output to a file on disk. Then the user could examine those files to understand what each node is doing and whether it is stuck.
In this mode a signal could be interpreted by the coordinating process as a prompt to have the nodes emit their stack trace - perhaps to the same file.
Thoughts?
A bit more on the 'files on disk' suggestion:
- Initially it could be controlled by an environment variable, e.g.
GINKGO_DEBUG_DIR
- it would be good to specify the directory as you might want to put the output somewhere that it can be captured (for instance if running in Concourse)
- the style of files could be similar to UNIX syslog - i.e. a high resolution timestamp followed by human-readable text
- I don't think there's any need to do clever log rotation etc... for the first implementation
- Another (even simpler) implementation would be to have each worker log to STDOUT. This could be controlled by a
GINKGO_DEBUG
environment variable, or with multiple uses of the-v
flag - a bit likessh
which allows things likessh -v -v -v
.
I think I lean towards a flag over an env-var as it is ginkgo
's primary interaction model. Can you share why environment variables might be preferred?
My only concern with the files on disk is it could be harder to get those files if the tests are running on concourse for example. Collecting the information and printing it to stdout could help debugging hanging tests in CI if concourse sends a SIGTERM and allow enough time for ginkgo to print the information to stdout
Files on disk is easy to look at when debugging failures in CI with hijack, etc, and easy to follow live locally with tail -f
.
stdout isn't suitable, both because of redaction concerns in CI, and because of difficulties watching output from multiple nodes in a single stream when watching locally.
I prefer writing to a file as a solution over both an API and stdout.
@onsi, I don’t have a strong view on whether it’s a flag or an environment variable. I would lean towards an environment variable if it’s a “pre-release” feature that I want feedback from specific users, and might change radically. And a command line flag is more appropriate for something that’s stable and is ready for used by anyone.
I've pushed a file-based approach to parallel-debug-file
. Would love folks to take a look and send feedback. @williammartin might need your help vendoring a couple of dependencies that needed to get pulled in if we choose to go down this road.
With this approach
ginkgo -p -debug
will emit output from each node to files named ginkgo-node-N.log
in the test's directory.
@onsi, after what must seem like a geological age, here are some very quick thoughts on the parallel-debug-file
branch:
- If you run ginko recursively, then you get a lot of files which are a bit tricky to clean up. You could add them to
.gitignore
obviously. I just wonder whether a better user experience would be to put them somewhere else. But that could be fixed in a later iteration. - It would be nice if the lines had ISO 8601 timestamps at the start, a bit like a traditional UNIX syslog. That's particularly useful when looking at potential hangs as you have some idea about when the last line was written.
- I wouldn't worry too much about vendoring just yet. The branch works in Travis. Vendoring in Go is very much in flux, and there are different solutions for different versions of Go.
thanks for taking a look @blgm
I'd love to find an interface that works well. I think the file approach has merit and agree that ending up with lots of files all over the place isn't great. Any thoughts on an alternative design?
As for timestamps. Unfortunately given the syscall hackery involved in rerouting stdout I don't think I can inject timestamps.
Perhaps creating them all in the same directory with a name scheme like ginkgo-node-U-N.log
where U
is some kind of ID? But it would make it harder to correlate files to the code that they relate to.
I'd be in favour of merging this as is, with a caveat that it's an experimental feature subject to change. Most users would not add the --debug
flag so would not see any difference. But users who observe occasional Ginkgo hangs would have a tool to help them out. We could then get feedback from these users, and base any future changes on that feedback.
Just out of curiosity, what were the reasons to not follow the suggestion from @vito in https://github.com/onsi/ginkgo/issues/323 ?
I am not super familiar with the way Ginkgo handles parallelism, but each node could trap SIGQUIT and output the stack dump to a known file. Maybe we could also create a new process group with the coordinating node and the nodes running the test, so that a user could kill them all with one command and get the stack dumps.
The advantage of this approach is that there's no need to start Ginkgo with special flags. A disadvantage is that maybe the stack dump is harder to read than logs from the nodes.
Oh I have missed the second comment on this thread.
I was imagining you might want to get visibility into the test without it hanging, but could explore some of these options.
^_ fair enough, but it assumes users are starting ginkgo with the --debug
flag. Which is a bit strange... unless they are having issues :)
Good point @nodo
We should update the docs here: https://onsi.github.io/ginkgo/#ginkgo-and-continuous-integration
To recommend that folks add the debug flag in CI environments
I've merged the file solution via #499. I'll leave this open to remind us to update the docs.
I'll leave this open to remind us to update the docs.
Is it working @blgm ? :)