truffleruby
truffleruby copied to clipboard
Timeout while running timeout tests
👋 Hi! I recently looked at the CI run on the upstream repo. The job truffleruby-head on ubuntu-latest took a while (>26 minutes) to do anything in the step that executes the tests. https://github.com/ruby/timeout/actions/runs/10595331126/job/29360878189
The step is...
bundle exec rake test
Got ideas on what's going on? The macos-latest seem to be done in 5s.
(If you wish to shunt this Issue comment to its own something, Discussion, Issue, comment thread in the ruby/timeout repo, please shunt away.)
Originally posted by @olleolleolle in https://github.com/oracle/truffleruby/issues/3426#issuecomment-2315055110 Ref: https://github.com/ruby/timeout/pull/50#issuecomment-2314984707
Not sure, no, there is definitely something hanging there. The only output we see is:
2024-08-28T10:58:52.7217665Z ##[group]Run bundle exec rake test
2024-08-28T10:58:52.7218348Z [36;1mbundle exec rake test[0m
2024-08-28T10:58:52.7393963Z shell: /usr/bin/bash -e {0}
2024-08-28T10:58:52.7394516Z ##[endgroup]
2024-08-28T12:47:11.7829453Z ##[error]The operation was canceled.
2024-08-28T12:47:11.7912328Z Post job cleanup.
i.e. no output at all. Unfortunately GitHub Actions when they are cancelled or timeout seem to use SIGKILL so there not even an opportunity to flush any IO buffers (though on TruffleRuby stdout & stderr are unbuffered so at least that shouldn't be a problem), print a stacktrace, or anything like that.
It seems somewhat transient yet frequent, looking at https://github.com/ruby/timeout/actions BTW it also hapenned on the 24.0.2 release in https://github.com/ruby/timeout/actions/runs/10594875253/job/29359440681
I ran it 10 times locally on Linux and it succeeded each time and took between 5 and 10 seconds. So it seems hard to reproduce.
I think the next step would be to find or create a "soft timeout utility" for GitHub Actions. I.e. some program that runs a given shell command with a given timeout and when the timeout is reached sends SIGQUIT (which shows Java stacktraces on TruffleRuby), waits a couple seconds, then send SIGALRM (which shows Ruby stacktraces on TruffleRuby), wait a couple seconds and then send SIGKILL. It could be just a small Ruby script or in any other language. FWIW mspec already does something like that in https://github.com/ruby/mspec/blob/5354a405e327bb144f00157b913aa544c3b77e4b/lib/mspec/runner/actions/timeout.rb#L110-L144
(Maybe this could also be available as a command-line option of TruffleRuby for convenience, although that would very likely need to create an extra Ruby Thread internally to handle this which doesn't seem ideal.)