[bug] v4 hangs uploading from mac runners
What happened?
When I upload artifacts at the end of a build on a mac runner (macos-13-xl-arm64), about 1 time in 3, the upload stalls part way through and never completes. The job is eventually cancelled by GHA and my entire workflow is marked cancelled. I cannot restart that job as it is not marked as a failure.
What did you expect to happen?
The artifacts upload to completion and the job finishes without error.
How can we reproduce it?
It is not easily reproducible. The workflows are private and cannot be shared here. I will open a support ticket to share more.
Anything else we need to know?
I have updated multiple workflows from v3 to v4 - all jobs in the workflow are using v4. I have 6 linux build jobs that all upload artifacts at the end. None of them have failed. For the mac builds, two jobs do architecture-specific builds and upload their artifacts. This has never failed. The final mac job takes the architecture-specific binaries and produces universal binaries. It is this job that has failed about 1 time in 3 when uploading the artifacts at the end. This failing job is typically always the last of the jobs to run. It necessarily runs after the two previous mac jobs, and the 6 previous linux jobs are quicker that the mac jobs and complete before.
The size of the failing artifacts is about 1.4G but on one run at least, it hung after logging the first 8MiB chunk.
All uploads for all the above-described jobs have:
compression: 0
retention-days: 1
The path: setting for the linux jobs is a single directory. The path: setting for the working mac jobs is a single directory and a single exclude pattern. The path: setting for the failing jobs is a single directory and three exclude patterns. Prior to uploading the chunks, the count of files to be uploaded is correct.
The overwrite: setting is false for all jobs.
Output from one of the failed/stalled runs:
Run actions/upload-artifact@v4
with:
name: release-mac
compression-level: 0
retention-days: 1
path: build/artifacts
!**/*-unsigned*
!**/*-arm64*
!**/*-amd64*
if-no-files-found: warn
overwrite: false
env:
GOMODCACHE: /tmp/gomodcache
GOCACHE: /tmp/gocache
NODE_VERSION: 18.19.1
pythonLocation: /Users/runner/hostedtoolcache/Python/3.11.7/arm64
PKG_CONFIG_PATH: /Users/runner/hostedtoolcache/Python/3.11.7/arm64/lib/pkgconfig
Python_ROOT_DIR: /Users/runner/hostedtoolcache/Python/3.11.7/arm64
Python2_ROOT_DIR: /Users/runner/hostedtoolcache/Python/3.11.7/arm64
Python3_ROOT_DIR: /Users/runner/hostedtoolcache/Python/3.11.7/arm64
With the provided path, there will be 12 files uploaded
Artifact name is valid!
Root directory input is valid!
Beginning upload of artifact content to blob storage
Uploaded bytes 8388608
That is the end of the output for the step.
What version of the action are you using?
v4.3.1
What are your runner environments?
macos
Are you on GitHub Enterprise Server? If so, what version?
No response
Using the following workflow has reproduced the issue multiple times for me:
name: Reproduce Mac upload-artifact failure
on:
push:
branches:
- camh/repro-mac-upload-artifact-issue
jobs:
test-upload-artifact:
runs-on: macos-13-xl-arm64
steps:
- name: Create files (500MiB)
run: |
dd if=/dev/urandom of=artifact bs=1M count=500
dd if=/dev/zero of=artifact-unsigned bs=1M count=1
- name: upload artifact
uses: actions/upload-artifact@v4
with:
name: urandom
compression-level: 0
retention-days: 1
path: |
artifact
!*-unsigned*
Please note that I was testing this on the branch camh/repro-mac-upload-artifact-issue, hence the workflow push trigger. You will obviously need to change this to whatever branch name you are using, or perhaps merge a workflow_dispatch triggered version and manually trigger it each time.
The purpose of the artifact-unsigned file in this was to reproduce something close to my actual workflow that is failing. I do not know if the excluded path is significant or not, but that is the pattern I am using so kept this reproduction scenario close.
I have run this workflow with a 500MiB test file twice, and twice it has failed (currently it is stalled on the second run and I am waiting for it to time out). I did a number of runs with a 100MiB file and saw one or two failures with about 10 successes. It seems a larger file is easier to trigger this failure.
have the same issue here.
@camscale Does this happen if you use v3 of the action? Have you tried using a different image (macos-14) or an alternative runner like FlyCI?
I encountered this issue twice.
One is on macos-14
one is on FlyCI
FYI
@ligaz I never encountered this problem with v3 of the action. Only after upgrading to v4 did I start to see this issue. I have not tried the macos-14 runners.
I'm experiencing the same issue on the ubuntu runner. I only encounter this problem on v4.
I have just tried the macos-14 arm64 runners (macos-14-xlarge) and the issue is present there too.
Interestingly, I just noticed one of my runs that stalled for 13 minutes, but started again and ran to completion. I don't think I've seen that before - I've had a hang sit there for two hours before I cancelled it.
Wed, 13 Mar 2024 19:44:53 GMT Uploaded bytes 301989888
Wed, 13 Mar 2024 19:58:06 GMT Uploaded bytes 310378496
Wed, 13 Mar 2024 19:58:06 GMT Uploaded bytes 311799972
Wed, 13 Mar 2024 19:58:06 GMT Finished uploading artifact content to blob storage!
I'm seeing sporadic upload failures lately too, on macOS runners only so far, and only for jobs uploading larger artifacts (around 400 MB). The weird thing is that the job shows up as failed, but the upload step is still running and doesn't show any error at all:
E.g., here's a currently running workflow, with 2 upload failures for both macOS arm64 jobs (running on macos-14 runners, but IIRC, macos-12 failed too in the past): https://github.com/ldc-developers/llvm-project/actions/runs/8492339227/job/23265222880
Got the same error with a self-hosted mac runner, consistently exactly 5 minutes after the step started. Reverting to v3 resolved the issue.
Same issue here with v4 version. I don't want to get back to v3, because that works 5 times slower. Any fix expected anytime soon?
I think I can repro this 100% on both v3 and v4. I am using CMake and CPack to build the package. If the DragNDrop generator is used, then the subsequent upload-artifact step fails. I suspect this is because the DS_Store.scpt performs some UI work to set up the window and icon positions.
My project is not yet open source, but it based heavily on cleanCppProject by kracejic, so you may be able to repro using that.
I hope this helps - I have burned up a lot of my credit debugging this issue!
EDIT: This is on a macos-13 runner
I think I can repro this 100% on both v3 and v4.
Then that's a different issue. This issue only started happening with v4. But thanks for the input anyway - it will be useful to others if they hit the same problem as you.
On this issue, for the github-hosted mac runners, I believe this is now fixed. I've been working with GitHub to get this resolved and the last round of changes to the runners has seen the issues resolved in my case - I have not had this hang for a week or two now, that I think I can say it is resolved for me.
Then that's a different issue. This issue only started happening with v4. But thanks for the input anyway - it will be useful to others if they hit the same problem as you.
Thanks for getting back to me. I have been able to create a minimal repo to reproduce this issue and reported it at https://github.com/actions/upload-artifact/issues/573
On this issue, for the github-hosted mac runners, I believe this is now fixed.
I’m afraid I’ve been hit by this again. We have self-hosted Mac runner. We’ve switched to v4 last week and yesterday I’ve hit it again. Reverting back to v3 made the artifact upload successfully. our artifact is a 300MB zip file
Edit/Add exact build versions:
Current runner version: '2.317.0'
Download action repository 'actions/upload-artifact@v4' (SHA:65462800fd760344b1a7b4382951275a0abb4808)