runner The runner doesn't finish when a task fails

Describe the bug I one of our systemd repositories, I came across a strange issue where if a step fails, the runner correctly reports the step failed, but then it remains stuck in the "running" state for several hours until it's killed by the global watchdog. Also, killing the job manually in this state is also... strange, since hitting "Cancel workflow" once doesn't work, one has to hit it many times to make it kill the job.

This feels like it's the same issue as https://github.com/actions/runner/issues/700 (and given the last comment from @rubencodes it might still be unresolved).

To Reproduce Steps to reproduce the behavior: I still can't pinpoint what exactly causes this issue, but I can reproduce it reliably in this PR: https://github.com/redhat-plumbers/systemd-rhel8/pull/302, where the problematic steps looks like sudo -E script.sh -> docker exec ... ninja -C build test.

Expected behavior The job should report a failure immediately when a step fails (or in a reasonable time frame).

Runner Version and Platform

2.294.0

OS of the machine running the runner? OSX/Windows/Linux/... Ubuntu 22.04

What's not working?

The job reports it failed, e.g.:

Summary of Failures:
375/377 fuzz-varlink_oss-fuzz-14688_address                                        FAIL             0.01s   exit status 127
376/377 fuzz-varlink_oss-fuzz-14708_address                                        FAIL             0.01s   exit status 127
Ok:                 365 
Expected Fail:      0   
Fail:               2   
Unexpected Pass:    0   
Skipped:            10  
Timeout:            0   
Full log written to /build/build/meson-logs/testlog.txt
FAILED: meson-test 
/usr/bin/meson test --no-rebuild --print-errorlogs
ninja: build stopped: subcommand failed.
Error: Process completed with exit code 1.

but then remains stuck for several hours in "running" state until it's eventually cancelled by the global watchdog:

The job itself: https://github.com/redhat-plumbers/systemd-rhel8/actions/runs/2690828471/attempts/2

A bit of further debugging shows that the information about the failed process correctly "bubbles up" the tree all the way to the runner, which then gets stuck:

 Summary of Failures:
154/372 test-bpf                                                                   FAIL             0.03s   killed by signal 6 SIGABRT
Ok:                 362 
Expected Fail:      0   
Fail:               1   
Unexpected Pass:    0   
Skipped:            9   
Timeout:            0   
Full log written to /build/build/meson-logs/testlog.txt
FAILED: meson-test 
/usr/bin/meson test --no-rebuild --print-errorlogs
ninja: build stopped: subcommand failed.
+ at_exit
+ echo 'Hello from at_exit()'
+ pstree -Aapust 3388
Hello from at_exit()
systemd,1
  `-provisioner,668 --agentdirectory /home/runner/runners --settings /opt/runner/provisioner/.settings
      `-Runner.Listener,2035,runner run
          `-Runner.Worker,2057 spawnclient 112 115
              `-bash,3386 -e /home/runner/work/_temp/73c812fa-ef6d-4cb8-a92a-db5c79d10865.sh
                  `-sudo,3387,root -E .github/workflows/unit_tests.sh RUN_GCC
                      `-unit_tests.sh,3388 .github/workflows/unit_tests.sh RUN_GCC
                          `-pstree,13946 -Aapust 3388
+ exit 1
Error: Process completed with exit code 1.
<here the runner stops responding>

Jul 20 '22 16:07 mrc0mmand

Hey @mrc0mmand,

Thank you for reporting it and providing follow-up information. I have triaged this issue and added it to the board. :relaxed:

Aug 02 '22 09:08 nikola-jokic

same problem https://github.com/whp98/telegram-bot-api-build/actions/runs/2985261238

Sep 03 '22 18:09 whp98

@nikola-jokic, for what it's worth, my team is experiencing this issue with our self-hosted runners.

I provided a link and screenshot to our most recently problematic job run, which has continued to show as "pending" despite the job successfully completing 20 hours prior.

job run - https://github.com/rapidsai/cucim/actions/runs/4107749806/jobs/7087812760

Feb 07 '23 17:02 ajschmidt8

This issue is intermittent but we are also facing the same, the step (subtask) remains in pending state forever even when the job completed steps which are downstream. Are we tracking the issue ? @nikola-jokic

Apr 03 '23 08:04 mmadhur-cops

Any updates ? @nikola-jokic

Jun 08 '23 07:06 mmadhur-cops

Is this issue being tracked by anyone ? @nikola-jokic

Sep 05 '23 08:09 mmadhur-cops

I am experiencing same issue - it keeps running even after the completion - Any update on this?

Sep 28 '23 06:09 tepatelcmc

We are facing the same issue fairly consistently.

Jan 19 '24 18:01 JamesMBartlett

It's been a long time. We are facing the same issue! Any update?

Jul 09 '24 16:07 vikranth-t