bitcoin icon indicating copy to clipboard operation
bitcoin copied to clipboard

CI timeouts

Open maflcko opened this issue 1 year ago • 21 comments

Looks like some CI tasks are timing out, starting a few days ago.

It would be good to investigate and fix the problem.

A quick glance on the ccache stats may indicate that ccache stopped working for some tasks.

However, CI should also pass with a clean ccache, so the issue may be a different one, or there may be more than one issue.

maflcko avatar Sep 09 '24 06:09 maflcko

A quick glance on the ccache stats may indicate that ccache stopped working for some tasks.

Yea, seems like our ccache efficiency has tanked post-CMake. Is it possible this is somehow a stagnant cache or related issue (although given it's been 2 weeks since the switch, I'd think caches would have cycled)?.

ccache hits for some CI jobs of the last merge into master (https://github.com/bitcoin/bitcoin/runs/29789973072):

32-bit CentOS : Hits: 1 / 715 (0.14 %) Win64 Cross: Hits: 1 / 713 ( 0.14%) 32 -bit multiprocess: Hits: 76 / 747 (10.17%) macOS native: Hits: 330 / 717 (46.03%) *SAN job: Hits: 347 / 720 (48.19%) Tidy job: Hits: 572 / 720 (79.44%) Previous releases: Hits: 589 / 724 (81.35 %)

However, CI should also pass with a clean ccache, so the issue may be a different one, or there may be more than one issue.

Right, and given that Previous releases seems to be timing out, even with 80% ccache hits, it seems there could be another issue (unless it was already on the borderline).

fanquake avatar Sep 09 '24 09:09 fanquake

@maflcko

Can you clear all ccache caches so that we can observe the CI jobs under these conditions?

hebasto avatar Sep 09 '24 10:09 hebasto

A quick glance on the ccache stats may indicate that ccache stopped working for some tasks.

Yea, seems like our ccache efficiency has tanked post-CMake.

I believe that https://cirrus-ci.com/build/5697885973512192 is more representative, as the change is docs-only.

As for Win64 Cross, the statistics shows Hits: 120 / 713 (16.83%), which also looks questionable.

hebasto avatar Sep 09 '24 11:09 hebasto

The CI machines may work on pull requests, which may (in)validate the cache, so ccache stats may generally not be 100% reliable.

It is possible to trigger a run with a clean cache volume locally (by deleting all volumes), or on CI, by creating a new namespace for the volumes:

diff --git a/ci/test/02_run_container.sh b/ci/test/02_run_container.sh
index afd447c347..53c0b4d8bc 100755
--- a/ci/test/02_run_container.sh
+++ b/ci/test/02_run_container.sh
@@ -25,15 +25,15 @@ if [ -z "$DANGER_RUN_CI_ON_HOST" ]; then
       --tag="${CONTAINER_NAME}" \
       "${BASE_READ_ONLY_DIR}"
 
-  docker volume create "${CONTAINER_NAME}_ccache" || true
-  docker volume create "${CONTAINER_NAME}_depends" || true
-  docker volume create "${CONTAINER_NAME}_depends_sources" || true
-  docker volume create "${CONTAINER_NAME}_previous_releases" || true
-
-  CI_CCACHE_MOUNT="type=volume,src=${CONTAINER_NAME}_ccache,dst=$CCACHE_DIR"
-  CI_DEPENDS_MOUNT="type=volume,src=${CONTAINER_NAME}_depends,dst=$DEPENDS_DIR/built"
-  CI_DEPENDS_SOURCES_MOUNT="type=volume,src=${CONTAINER_NAME}_depends_sources,dst=$DEPENDS_DIR/sources"
-  CI_PREVIOUS_RELEASES_MOUNT="type=volume,src=${CONTAINER_NAME}_previous_releases,dst=$PREVIOUS_RELEASES_DIR"
+  docker volume create "new_${CONTAINER_NAME}_ccache" || true
+  docker volume create "new_${CONTAINER_NAME}_depends" || true
+  docker volume create "new_${CONTAINER_NAME}_depends_sources" || true
+  docker volume create "new_${CONTAINER_NAME}_previous_releases" || true
+
+  CI_CCACHE_MOUNT="type=volume,src=new_${CONTAINER_NAME}_ccache,dst=$CCACHE_DIR"
+  CI_DEPENDS_MOUNT="type=volume,src=new_${CONTAINER_NAME}_depends,dst=$DEPENDS_DIR/built"
+  CI_DEPENDS_SOURCES_MOUNT="type=volume,src=new_${CONTAINER_NAME}_depends_sources,dst=$DEPENDS_DIR/sources"
+  CI_PREVIOUS_RELEASES_MOUNT="type=volume,src=new_${CONTAINER_NAME}_previous_releases,dst=$PREVIOUS_RELEASES_DIR"
 
   if [ "$DANGER_CI_ON_HOST_CACHE_FOLDERS" ]; then
     # ensure the directories exist

maflcko avatar Sep 09 '24 11:09 maflcko

I believe that https://cirrus-ci.com/build/5697885973512192 is more representative, as the change is docs-only.

Possibly, but this still looks all over the place:

32-bit CENTOS Hits: 0 / 715 (0.00 %) arm: Hits: 3 / 713 ( 0.42%) Win64 cross: Hits: 120 / 713 (16.83%) mac cross: Hits: 144 / 726 (19.83%) TSAN Hits: 165 / 724 (22.79%) No wallet, kernel: Hits: 307 / 695 (44.17%) MSAN: Hits: 290 / 645 (44.96%) Fuzzer: Hits: 281 / 447 (62.86%)

The CI machines may work on pull requests, which may (in)validate the cache, so ccache stats may generally not be 100% reliable.

Do we have any scope to increase the size of the ccache cache, to reduce the likelyhood of this?

fanquake avatar Sep 09 '24 11:09 fanquake

Can we not bump the ccache size on the (cirrus) workflows which we use self-hosted runners for? Currently these values are pretty low IMO, 100-400MB.

These ccaches IIUC are mounted volumes from the runner host itself, and don't need to be e.g. restored from a GitHub Actions cache, so do not need to be constrained the same as is best for those workflows.

Even at 1GB per cache, this would be 12GB total size, and so should be fine on an 80 or 160GB disk?

edit: ah I see fanquake got here first...

willcl-ark avatar Sep 09 '24 11:09 willcl-ark

About the cache issue, a better solution may be to set CCACHE_READONLY in pull requests.

However, the CI timeout issue seems to be of a different cause, because the runtime for the unit tests shouldn't fluctuate by 100%: https://cirrus-ci.com/task/4803595281891328?logs=ci#L8618 vs https://cirrus-ci.com/task/4981166594326528?logs=ci#L9169

maflcko avatar Sep 09 '24 11:09 maflcko

Thought this might just be constrained to that job, but: ASAN (currently running PR): Total Test time (real) = 1068.87 sec ASAN (master finished 23 minutes ago): Total Test time (real) = 498.22 sec.

fanquake avatar Sep 09 '24 15:09 fanquake

Asan is running on GHA infra, so this highlights that the issue may not be infra related, but related to the code, config or build files in this repo?

maflcko avatar Sep 09 '24 16:09 maflcko

Yea. Seems like it is also not a difference between PRs and merge commits, given that now the current merge run is slow, and a simultaneously running PR is fast: ASAN (most recent merge commit): Total Test time (real) = 1086.64 sec. ASAN (current PR run): Total Test time (real) = 510.99 sec.

fanquake avatar Sep 10 '24 11:09 fanquake

It also doesn't look cmake related, because the latest 28.x build also times out: https://github.com/bitcoin/bitcoin/runs/29755145359

maflcko avatar Sep 10 '24 11:09 maflcko

One contributing factor to the timeouts may be a recently introduced bug in the exclude parsing https://github.com/bitcoin/bitcoin/pull/30244#issuecomment-2344009687, which affects one specific CI task. However this still does not explain why suddenly all(?) CI tasks in all branches (at least master, 28.x) are affected.

maflcko avatar Sep 11 '24 15:09 maflcko

macOS also seems to time out: https://github.com/bitcoin/bitcoin/actions/runs/10828462406/job/30043878794 on GHA

maflcko avatar Sep 12 '24 14:09 maflcko

I opened https://github.com/bitcoin/bitcoin/pull/30879 to re-bucket these tests which are taking a few minutes longer on CI than they are bucketed for. It's not much difference, but may help a little.

willcl-ark avatar Sep 12 '24 14:09 willcl-ark

Yeah, I think we keep finding small mistakes, but there still must be a large underlying problem. In all the years, I've never seen the CI go from working fine to a large chunk of tasks timing out after 2(!) hours with no apparent reason, even across completely separate infrastructures (GHA, Cirrus workers)

maflcko avatar Sep 12 '24 14:09 maflcko

I recreated a CI timeout and the machine's CPU was idle flat at 0% for most of the part for 2 hours, but I didn't look where it spent so much time "doing nothing".

maflcko avatar Sep 12 '24 18:09 maflcko

I recreated a CI timeout and the machine's CPU was idle flat at 0% for most of the part for 2 hours, but I didn't look where it spent so much time "doing nothing".

I have noticed this in the cirrus GUI before, it seems to have always been like this though IIRC:

e.g. https://cirrus-ci.com/task/6481793086390272 : image

or a random pre-cmake run: https://cirrus-ci.com/task/6010695337115648

Perhaps these shared machines are hitting other IO limits? Doesn't seems to be much other explanation for it. Out of interest, what configuration did you use to re-create CI resources? I'd like to be able to do the same.

For compaison (with 0%) this is what I see when running unit and funcitonal tests locally (not a clean system, but gets the rough idea)

image

20 - 75s is waiting for these to complete:

133/134 Test #5: noverify_tests ....................... Passed 36.41 sec 134/134 Test #6: tests ................................ Passed 76.79 sec

And the drop-off towards the end is the longer-running functional tests completing:

Remaining jobs: [feature_assumeutxo.py, p2p_node_network_limited.py --v1transport, p2p_node_network_limited.py --v2transport, p2p_permissions.py, feature_config_args.py, p2p_handshake.py, p2p_handshake.py --v2transport, p2p_ibd_txrelay.py, p2p_seednode.py]

I will try a local CI run soon and see what that looks like, and if there's any difference.

willcl-ark avatar Sep 13 '24 09:09 willcl-ark

20 - 75s is waiting for these to complete:

133/134 Test #5: noverify_tests ....................... Passed 36.41 sec 134/134 Test #6: tests ................................ Passed 76.79 sec

This part is addressed in https://github.com/bitcoin-core/secp256k1/pull/1581.

hebasto avatar Sep 13 '24 09:09 hebasto

I have noticed this in the cirrus GUI before, it seems to have always been like this though IIRC:

Yeah, I think the graph there are not accurate and a bit misleading. It may be best to ignore them. I used a CPX51 Hetzner Box with 16 vCPU, which should be enough to run any CI task with a completely empty cache. I just looked at the CPU/IO usage in the Cloud Console Graph.

Perhaps these shared machines are hitting other IO limits? Doesn't seems to be much other explanation for it. Out of interest, what configuration did you use to re-create CI resources? I'd like to be able to do the same.

It is certainly possible that this could be an IO limit, but the other graphs were also at 0%, so the most plausible explanation would be that this is a network IO limit, not something local. I couldn't reproduce since then, but I keep trying to reproduce.

I could reproduce the macOS 2h timeout issue on GHA yesterday. However, today I also fail to reproduce them.

maflcko avatar Sep 13 '24 09:09 maflcko

network IO

Reminds me that Cirrus Logs are randomly disappearing. So maybe there is an issue in the Cirrus backend that streams logs and the streaming process dies or turns into a zombie process?

maflcko avatar Sep 13 '24 10:09 maflcko

Some examples:

  • https://cirrus-ci.com/task/5353088029032448 has missing logs (404). However, this one is not a timeout.
  • https://cirrus-ci.com/task/5260390286753792 is a timeout, but has logs
  • https://cirrus-ci.com/task/4532788332855296 is a timeout, and has missing logs

Edit: https://www.github.com/cirruslabs/cirrus-ci-docs/issues/1294

maflcko avatar Sep 13 '24 10:09 maflcko

I could reproduce the macOS 2h timeout issue on GHA yesterday. However, today I also fail to reproduce them.

Actually, it looks like the GHA macOS CI timeouts remain today, for example: https://github.com/bitcoin/bitcoin/actions/runs/10833854937/job/30163805645?pr=30866 or https://github.com/bitcoin/bitcoin/actions/runs/10829942238/job/30163952659?pr=30856

I am not really familiar with GHA, nor with macOS, so it would be good if someone else checked:

  • Does the macOS 13 GHA task succeed at all with an empty cache?
  • Does the M1 (macOS 14) GHA runner give a better performance?

Ref: https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/choosing-the-runner-for-a-job#standard-github-hosted-runners-for-public-repositories

maflcko avatar Sep 15 '24 14:09 maflcko

I am not really familiar with GHA, nor with macOS, so it would be good if someone else checked:

  • Does the macOS 13 GHA task succeed at all with an empty cache?

https://github.com/hebasto/bitcoin/actions/runs/10880757045 (caches have been cleaned for all jobs)

hebasto avatar Sep 16 '24 09:09 hebasto

I am not really familiar with GHA, nor with macOS, so it would be good if someone else checked:

  • Does the M1 (macOS 14) GHA runner give a better performance?

https://github.com/hebasto/bitcoin/actions/runs/10881199478/job/30189639531

hebasto avatar Sep 16 '24 09:09 hebasto

Thanks. So it looks like even with an empty cache, the macOS 13 GHA can complete in less than 50 minutes, way less than the 2h timeout. Also, the macOS 14 GHA may complete even faster in less than 40 minutes.

Not sure if this helps to debug this issue. Maybe https://github.com/bitcoin/bitcoin/pull/30866 and https://github.com/bitcoin/bitcoin/pull/30856 are based on an old(er) commit, which may be used by GitHub to run the CI. So I guess, rerunning them via the web interface won't help and a rebase could be considered.

(Unrelated to this issue, one could consider moving to the M1 runners, given that they may speed up the build. Is the removal of sudo xcode-select --switch /Applications/Xcode_15.0.app required? Personally I find it useful to have the minimum supported xcode version documented and checked by CI.)

maflcko avatar Sep 16 '24 10:09 maflcko

I did a macOS 13 GHA run (https://github.com/maflcko/bitcoin-core-with-ci/actions/runs/10886978505/job/30208236022) and it failed with: 2024-09-16T15:55:29.8666110Z [0;34m node0 stderr Error: A fatal internal error occurred, see debug.log for details: Corrupt block found indicating potential hardware failure. [0m in feature_block.py on ~master.

So possibly Github could be facing hardware issues on macOS, leading to other issues down the line?

maflcko avatar Sep 17 '24 07:09 maflcko

(Unrelated to this issue, one could consider moving to the M1 runners, given that they may speed up the build. Is the removal of sudo xcode-select --switch /Applications/Xcode_15.0.app required? Personally I find it useful to have the minimum supported xcode version documented and checked by CI.)

Proposed alternative in https://github.com/bitcoin/bitcoin/pull/30913

maflcko avatar Sep 17 '24 08:09 maflcko

I am seeing a 2h timeout on running clang --version, even before any CI-specific code is run: https://github.com/maflcko/bitcoin-core-with-ci/actions/runs/10897268811/job/30238399156

So I fail to see how this could be anything other than a GitHub hardware, software, or data corruption bug.

I don't think this can be fixed on our side, apart from removing the task.

maflcko avatar Sep 17 '24 08:09 maflcko

Closing for now. Possibly commit a95e742b69207d7541afc6903b3fd72131f4d6de (x86_64-apple-darwin22.6.0 -> arm64-apple-darwin23.6.0) may have switched to using more recent hardware, which may be less likely to be corrupt.

In any case, if someone is still seeing a CI timeout (that is a CI failure caused by a 2 hour timeout), or a CI task that takes longer than half of the timeout (1h), it would be good to mention it either in this issue, or in a new issue.

The 2 hour CI timeout is picked, so that even with an empty ccache, the CI is able to easily pass. It only exists to catch real timeouts (network timeout, zombie process, infinite sleep, infinite loop, ...). Any task taking longer than half the overall task timeout should be tracked down and fixed.

maflcko avatar Sep 18 '24 07:09 maflcko

Looks like this is still a problem. The macOS 14 task was able to finish in less than 30 minutes, even without a ccache at all.

Now it is stuck again for more than 90 minutes: https://github.com/bitcoin/bitcoin/actions/runs/10956545011/job/30422825062?pr=30935#step:7:3249

maflcko avatar Sep 20 '24 10:09 maflcko