codecov-action icon indicating copy to clipboard operation
codecov-action copied to clipboard

Inscrutable "Actions workflow run is stale" error

Open blast-hardcheese opened this issue 4 years ago • 18 comments

I'm getting a lot of sporadic failures in reporting, possibly due to the number of parallel builds that are attempting to submit coverage reports.

The way my project is configured is to build the core tests, which takes about four minutes, then builds over twenty other integration tests, each of which takes five or more minutes. It seems as though we may be dancing right on the edge of some sort of limit, possibly due to my naive understanding of the after_n_builds option.

Unfortunately, Googling anything about {'detail': ErrorDetail(string='Actions workflow run is stale', code='not_found')} turns up nothing, so hopefully after now, at least people will find this issue.

Would you kindly explain how to either increase the timeout for when codecov is waiting for coverage segments, or if this is not the case, instruct on how to resolve this error?

Thank you for your assistance, as well as for an excellent product!


==> Uploading reports
    url: https://codecov.io
    query: branch=update%2Fjackson-core-2.12.1&commit=5e16535e81483a6a07612ba10cfe32c328469103&build=598338763&build_url=http%3A%2F%2Fgithub.com%2Ftwilio%2Fguardrail%2Factions%2Fruns%2F598338763&name=&tag=&slug=twilio%2Fguardrail&service=github-actions&flags=&pr=927&job=CI&cmd_args=n,F,Q,Z,f
->  Pinging Codecov
https://codecov.io/upload/v4?package=github-action-20210129-7c25fce&token=secret&branch=update%2Fjackson-core-2.12.1&commit=5e16535e81483a6a07612ba10cfe32c328469103&build=598338763&build_url=http%3A%2F%2Fgithub.com%2Ftwilio%2Fguardrail%2Factions%2Fruns%2F598338763&name=&tag=&slug=twilio%2Fguardrail&service=github-actions&flags=&pr=927&job=CI&cmd_args=n,F,Q,Z,f
{'detail': ErrorDetail(string='Actions workflow run is stale', code='not_found')}
404
==> Uploading to Codecov
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  182k  100    81  100  182k    400   904k --:--:-- --:--:-- --:--:--  904k
    {'detail': ErrorDetail(string='Actions workflow run is stale', code='not_found')}
Error: Codecov failed with the following error: The process '/usr/bin/bash' failed with exit code 1

blast-hardcheese avatar Feb 25 '21 05:02 blast-hardcheese

Hi @blast-hardcheese, we are working to understand the issue here, but I think for now as a workaround, you can supply the Codecov upload token. Do you have a GitHub Actions CI link that we can take a look at btw?

thomasrockhu avatar Feb 26 '21 15:02 thomasrockhu

Do you have a GitHub Actions CI link that we can take a look at btw?

Sure -- you can take a look at many of the recent failures on https://github.com/guardrail-dev/guardrail/ , one example is https://github.com/guardrail-dev/guardrail/pull/1000/checks?check_run_id=1976440163 .

I've just been re-running all the checks and usually the subsequent run is successful.

blast-hardcheese avatar Feb 26 '21 18:02 blast-hardcheese

Additionally, I've moved this repo out from where it was previously hosted, https://github.com/twilio/guardrail/ , within the past 24 hours -- that may impact your investigation. If you need more samples from after the repo was moved over, I can submit them as they come in -- library upgrade PRs are the most likely to trigger this, due to the rate of submission.

blast-hardcheese avatar Feb 26 '21 19:02 blast-hardcheese

More recent example after moving the repo to a new org and re-authorizing: https://github.com/guardrail-dev/guardrail/pull/1004/checks?sha=ff99a5dfa20d69e2f8519ca7d6569f5a6ebb63a8

blast-hardcheese avatar Feb 27 '21 17:02 blast-hardcheese

@blast-hardcheese, unless I'm missing something, I couldn't find the above error in that latest link. Apologies if it's really blatant and I missed it, but would you mind sharing the name of the job that failed?

thomasrockhu avatar Mar 02 '21 00:03 thomasrockhu

@thomasrockhu Ack! I didn't realize that re-running the workflow erased the failure, I thought links were stable.

I was able to reproduce the error on an already merged PR, so this should not change:

https://github.com/guardrail-dev/guardrail/pull/1004/checks?check_run_id=2009728188

Sorry about that!

blast-hardcheese avatar Mar 02 '21 05:03 blast-hardcheese

I don't know if this is related, but if this is a race condition, it very well may be -- we're also experiencing the exact opposite problem, where we successfully report all after_n_builds runs (22 runs) asynchronously to codecov.io for a PR, but the callback never fires, so we never get a response to the required codecov build phase.

A normal run looks like this: image

in this example, it was just hung like this (I've since merged the PR, but you can still see that Codecov is not in the reported checks for that PR, meaning the callback didn't fire): image

blast-hardcheese avatar Mar 02 '21 21:03 blast-hardcheese

@blast-hardcheese, I think I resolved most of the Actions workflow is stale. Let me know if that's not the case

As for the most recent example, it didn't fire because we had only received 16 builds (and not 22). It's a little challenging to see which build didn't upload properly, do you happen to know the names of the jobs?

thomasrockhu avatar Mar 03 '21 14:03 thomasrockhu

In that particular example, it looks like some/all of the Scala 15 builds didn't run tests or try to upload to Codecov

thomasrockhu avatar Mar 03 '21 14:03 thomasrockhu

Hi! Let me know if I should open a new issue for this, but we're having an identical problem. We're planning on reducing the size of our testing matrix in the near future, will this alleviate the problem? Otherwise if you could take a look that'd be great! Thanks :)

Yoshanuikabundi avatar Mar 11 '21 07:03 Yoshanuikabundi

In that particular example, it looks like some/all of the Scala 15 builds didn't run tests or try to upload to Codecov

You're completely correct. I didn't realize that I had excluded some coverage uploads while also using after_n_builds -- sorry for confusing the issue here.

I haven't seen the Actions workflow run is stale error for more than a week at this point, so may I ask what you did on your end? Is this something I could have done via the codecov UI somehow, and is there a possibility of this resurfacing? I've noticed some other 👍s on the initial issue, so presumably others are running into this as well

blast-hardcheese avatar Mar 16 '21 05:03 blast-hardcheese

(Also, thank you again for all your help here!)

blast-hardcheese avatar Mar 16 '21 05:03 blast-hardcheese

FWIW I have been running into this as recently as yesterday in my project too - https://github.com/laurynas-biveinis/unodb/runs/2109776375?check_suite_focus=true

In my case there are two flag-separated configurations, which get uploaded in parallel. Perhaps they should be serialized?

laurynas-biveinis avatar Mar 16 '21 06:03 laurynas-biveinis

@laurynas-biveinis I'm looking into making a patch for this. We should hopefully have that particular edge case fixed this week.

thomasrockhu avatar Mar 21 '21 19:03 thomasrockhu

I was having this problem and I found adding the Codecov token as a GitHub Actions secret helped. However, I'm now getting this error on every merge to my main branch, after the jobs for the same commit on its feature branch (pre-merge) succeeds.

briansmith avatar Mar 23 '21 07:03 briansmith

Here's my log of the failure: https://github.com/briansmith/ring/runs/2172862556?check_suite_focus=true

briansmith avatar Mar 23 '21 07:03 briansmith

I was having this problem and I found adding the Codecov token as a GitHub Actions secret helped.

Unfortunately for any github organization with a wider community this imposes a potential leakage of an access token, hence we at Nextcloud dropped our codecov tokens from the action because the readme says those are not required for public repositories.

Our current mitigation is to report coverage only for a few CI runs, though that can potentially lower the reported coverage as some paths are only triggered by certain tests in our matrix.

ChristophWurst avatar Mar 23 '21 07:03 ChristophWurst

I was having this problem and I found adding the Codecov token as a GitHub Actions secret helped. However, I'm now getting this error on every merge to my main branch, after the jobs for the same commit on its feature branch (pre-merge) succeeds.

I was mistaken. Although I did start the process of adding a Codecov token as a secret within my GitHub Actions workflow, I never got around to hooking it up to my use of this action, so it was never used. Thus it had no effect. It seems like Codecov must have addressed the issue here on its end.

In issue #300 I suggest a different solution that doesn't require using a Codecov access token: Move the uploading of coverage from the jobs that collect the coverage. If you have only one job that submits coverage data to codecov then you can avoid the timeout issue described above, AFAICT, and you can also properly minimize permissions on the GitHub token. You'd need to upload the coverage data as an artifact in each job that collects coverage information, and then download those artifacts in the job that submits the coverage information, and then use "needs:" to tell GitHub Actions about the dependency between the jobs.

briansmith avatar Apr 29 '21 22:04 briansmith

Closing as this no longer seems to be an issue.

thomasrockhu-codecov avatar Feb 28 '23 15:02 thomasrockhu-codecov