cache icon indicating copy to clipboard operation
cache copied to clipboard

Cache restoration intermittently hangs at +90%

Open tyrannosaurus-becks opened this issue 2 years ago • 5 comments

We are intermittently seeing our cache restoration hang when its completion is over 90%. It can hang for over 10 minutes, so we've set a timeout of 10 minutes to terminate the jobs. When our developers retry the workflow run without any change, the run often succeeds.

Sample logs:

2022-07-25T20:24:42.4105298Z Received 20971520 of 532098493 (3.9%), 20.0 MBs/sec
2022-07-25T20:24:43.4104976Z Received 125829120 of 532098493 (23.6%), 60.0 MBs/sec
2022-07-25T20:24:44.4105956Z Received 243269632 of 532098493 (45.7%), 77.3 MBs/sec
2022-07-25T20:24:45.4128635Z Received 356515840 of 532098493 (67.0%), 84.9 MBs/sec
2022-07-25T20:24:46.4134718Z Received 427819008 of 532098493 (80.4%), 81.6 MBs/sec
2022-07-25T20:24:47.4133829Z Received 465567744 of 532098493 (87.5%), 74.0 MBs/sec
2022-07-25T20:24:48.4138417Z Received 507510784 of 532098493 (95.4%), 69.1 MBs/sec
2022-07-25T20:24:49.4148580Z Received 511126973 of 532098493 (96.1%), 60.9 MBs/sec
2022-07-25T20:24:50.4167174Z Received 511126973 of 532098493 (96.1%), 54.1 MBs/sec
2022-07-25T20:24:51.4178143Z Received 511126973 of 532098493 (96.1%), 48.7 MBs/sec
2022-07-25T20:24:52.4191064Z Received 511126973 of 532098493 (96.1%), 44.3 MBs/sec
2022-07-25T20:24:53.4205470Z Received 511126973 of 532098493 (96.1%), 40.6 MBs/sec
2022-07-25T20:24:54.4218795Z Received 511126973 of 532098493 (96.1%), 37.5 MBs/sec
2022-07-25T20:24:55.4232178Z Received 511126973 of 532098493 (96.1%), 34.8 MBs/sec
2022-07-25T20:24:56.4246845Z Received 511126973 of 532098493 (96.1%), 32.5 MBs/sec
< log lines showing 96.1% repeat until manually terminated, with MBs/sec dropping and the following finish >
2022-07-25T20:58:16.9998830Z Received 511126973 of 532098493 (96.1%), 0.2 MBs/sec
2022-07-25T20:58:17.0660707Z ##[error]The operation was canceled.

Here is the code that has triggered this cache restoration. It is inside a matrix step that triggers 20 concurrent jobs, so it's possible that resource contention is a factor:

      - name: Precompiled Assets Cache
        uses: actions/cache@v3
        id: precompiled-assets-cache
        with:
          path: |
            vendor/assets/components/**
            public/assets/**
            public/packs/**
            tmp/**
          key: precompiled-assets-${{ env.CACHE_VERSION }}-${{github.ref}}-${{github.sha}}
          restore-keys: |
            precompiled-assets-${{ env.CACHE_VERSION }}-${{github.ref}}-
            precompiled-assets-${{ env.CACHE_VERSION }}-refs/heads/main-
            precompiled-assets-${{ env.CACHE_VERSION }}-

Here is the code that originally cached the data:

      - name: Precompiled Assets Cache
        uses: actions/cache@v3
        id: precompiled-assets-cache
        with:
          path: |
            vendor/assets/components/**
            public/assets/**
            public/packs/**
            tmp/**
          key: precompiled-assets-${{ env.CACHE_VERSION }}-${{github.ref}}-${{github.sha}}
          restore-keys: |
            precompiled-assets-${{ env.CACHE_VERSION }}-${{github.ref}}-
            precompiled-assets-${{ env.CACHE_VERSION }}-refs/heads/main-
            precompiled-assets-${{ env.CACHE_VERSION }}-

Please let me know if we're doing anything incorrectly, would love to learn if this is occurring due to the way we're using the feature, or if perhaps there's a limit that we're hitting.

Thanks, maintainers, for all that you do! ❤️

tyrannosaurus-becks avatar Jul 28 '22 18:07 tyrannosaurus-becks

This is an issue in Azure storage SDK and we are following up with them.

bishal-pdMSFT avatar Jul 29 '22 09:07 bishal-pdMSFT

Thanks for the update! I will follow that ticket.

tyrannosaurus-becks avatar Jul 29 '22 15:07 tyrannosaurus-becks

perhaps as a bandaid can #824 be considered?

asottile-sentry avatar Jul 29 '22 18:07 asottile-sentry

Hi there, 👋🏽

We have released a new actions/cache version v3.0.7 that doesn't run the stuck runner for 6 hours, instead aborts the download and gets out of the cache restore step in case the download doesn't complete within an hour. Users using actions/cache@v3 need not do any change and should see this behaviour in their action runs starting today.

This has been done as an intermittent fix as we are still waiting for the azure team to release the fix for the same. https://github.com/Azure/azure-sdk-for-js/issues/22321

kotewar avatar Aug 13 '22 05:08 kotewar

@kotewar can that be configured down to say ~3 minutes? an hour is pretty bad still

asottile-sentry avatar Aug 15 '22 13:08 asottile-sentry

@asottile-sentry, we are working on making it configurable to the user so that everyone can set their own timeouts... Hope that works?

Also we're tracking this issue here, hence closing this one.

Thanks for your patience everyone! 🙇🏽

kotewar avatar Aug 17 '22 11:08 kotewar