argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

EOF when many steps using same big input artifact

Open tooptoop4 opened this issue 2 years ago • 8 comments

i have a 1.2GB artifact on s3

intermittently some of the tasks that use the same artifact as input gets below error:


Error (exit code 1): tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now

OR

Error (exit code 1): gzip: write: Out of memory tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now

logs also mention max duration:

time="2022-09-06T02:08:48.942Z" level=info msg="Processing workflow" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.945Z" level=info msg="Task-result reconciliation" namespace=auth numObjs=3 workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.945Z" level=info msg="node changed" new.message="Error (exit code 1): tar: Unexpected EOF in archive\ntar: Unexpected EOF in archive\ntar: Error is not recoverable: exiting now" new.phase=Error new.progress=0/1 nodeID=bitbucket-workflow-20220906t12032810007frlh-3794695805 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-09-06T02:08:48.945Z" level=info msg="node unchanged" nodeID=bitbucket-workflow-20220906t12032810007frlh-2620825050
time="2022-09-06T02:08:48.945Z" level=info msg="node changed" new.message="Error (exit code 1): tar: Unexpected EOF in archive\ntar: Unexpected EOF in archive\ntar: Error is not recoverable: exiting now" new.phase=Error new.progress=0/1 nodeID=bitbucket-workflow-20220906t12032810007frlh-217183810 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-09-06T02:08:48.945Z" level=info msg="node changed" new.message= new.phase=Running new.progress=0/1 nodeID=bitbucket-workflow-20220906t12032810007frlh-1085538644 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-09-06T02:08:48.945Z" level=info msg="node changed" new.message="Error (exit code 1): gzip: write: Out of memory\ntar: Unexpected EOF in archive\ntar: Unexpected EOF in archive\ntar: Error is not recoverable: exiting now" new.phase=Error new.progress=0/1 nodeID=bitbucket-workflow-20220906t12032810007frlh-3703802753 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-09-06T02:08:48.945Z" level=info msg="node unchanged" nodeID=bitbucket-workflow-20220906t12032810007frlh-2444850475
time="2022-09-06T02:08:48.945Z" level=info msg="node unchanged" nodeID=bitbucket-workflow-20220906t12032810007frlh-2697090281
time="2022-09-06T02:08:48.945Z" level=info msg="node changed" new.message= new.phase=Running new.progress=0/1 nodeID=bitbucket-workflow-20220906t12032810007frlh-1264146474 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-09-06T02:08:48.946Z" level=info msg="SG Outbound nodes of bitbucket-workflow-20220906t12032810007frlh-3571636647 are [bitbucket-workflow-20220906t12032810007frlh-2620825050]" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.946Z" level=info msg="SG Outbound nodes of bitbucket-workflow-20220906t12032810007frlh-3332293106 are [bitbucket-workflow-20220906t12032810007frlh-2697090281]" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.946Z" level=info msg="SG Outbound nodes of bitbucket-workflow-20220906t12032810007frlh-213337688 are [bitbucket-workflow-20220906t12032810007frlh-2444850475]" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.947Z" level=info msg="Max duration limit exceeded. Failing..." namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.947Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-610152142 phase Running -> Error" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.947Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-610152142 message: Max duration limit exceeded" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.947Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-610152142 finished: 2022-09-06 02:08:48.947745613 +0000 UTC" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.948Z" level=info msg="Max duration limit exceeded. Failing..." namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.948Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-2763505215 phase Running -> Error" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.948Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-2763505215 message: Max duration limit exceeded" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.948Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-2763505215 finished: 2022-09-06 02:08:48.948248846 +0000 UTC" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.949Z" level=info msg="Max duration limit exceeded. Failing..." namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.949Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-3681608794 phase Running -> Error" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.949Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-3681608794 message: Max duration limit exceeded" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh
time="2022-09-06T02:08:48.949Z" level=info msg="node bitbucket-workflow-20220906t12032810007frlh-3681608794 finished: 2022-09-06 02:08:48.949245597 +0000 UTC" namespace=auth workflow=bitbucket-workflow-20220906t12032810007frlh

another run's 'init' container logs:

time="2022-09-06T02:52:23.726Z" level=info msg="Start loading input artifacts..."
time="2022-09-06T02:52:23.726Z" level=info msg="Downloading artifact: repo"
time="2022-09-06T02:52:23.726Z" level=info msg="S3 Load path: /argo/inputs/artifacts/repo.tmp, key: argo_wf_logs/2022/09/06/02/4
9/bitbucket-workflow-20220906t12490510007g2tv/bitbucket-workflow-20220906t12490510007g2tv-3635102771/repo.tgz"
time="2022-09-06T02:52:23.726Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2022-09-06T02:52:25.220Z" level=info msg="Getting file from s3" bucket=myredactbucket en
dpoint=s3.amazonaws.com key=argo_wf_logs/2022/09/06/02/49/bitbucket-workflow-20220906t12490510007g2tv/bitbucket-workflow-2022090
6t12490510007g2tv-3635102771/repo.tgz path=/argo/inputs/artifacts/repo.tmp
time="2022-09-06T02:52:38.714Z" level=info msg="Detecting if /argo/inputs/artifacts/repo.tmp is a tarball"
time="2022-09-06T02:52:38.714Z" level=info msg="tar -xf /argo/inputs/artifacts/repo.tmp -C /argo/inputs/artifacts/repo.tmpdir"
time="2022-09-06T02:52:49.455Z" level=error msg="`tar -xf /argo/inputs/artifacts/repo.tmp -C /argo/inputs/artifacts/repo.tmpdir`
 failed: gzip: write: Out of memory\ntar: Unexpected EOF in archive\ntar: Unexpected EOF in archive\ntar: Error is not recoverab
le: exiting now\n"
time="2022-09-06T02:52:49.601Z" level=error msg="executor error: gzip: write: Out of memory\ntar: Unexpected EOF in archive\ntar
: Unexpected EOF in archive\ntar: Error is not recoverable: exiting now"
time="2022-09-06T02:52:49.601Z" level=info msg="Alloc=9938 TotalAlloc=18507 Sys=29138 NumGC=5 Goroutines=4"
time="2022-09-06T02:52:49.601Z" level=fatal msg="gzip: write: Out of memory\ntar: Unexpected EOF in archive\ntar: Unexpected EOF
 in archive\ntar: Error is not recoverable: exiting now"

its a fan out workflow: image

version 3.3.9

tooptoop4 avatar Sep 06 '22 02:09 tooptoop4

Have you used PodSpecPatch to increase the resource request?

hbrewster-splunk avatar Sep 06 '22 03:09 hbrewster-splunk

@hbrewster-splunk that worked but i think the docs should be updated to mention changing that for init container on steps that use a big input artifact

                  - name: mystep
                    podSpecPatch: '{"initContainers":[{"name":"init", "resources":{"requests":{"memory": "2Gi", "cpu": "300m" },"limits":{"memory": "3Gi", "cpu": "900m" }}}]}'
                    inputs:
                      artifacts:
                        - name: repo
                          path: /repo
                    container:
                      image: blabla

tooptoop4 avatar Sep 06 '22 06:09 tooptoop4

Feel free to submit a PR to improve the docs.

terrytangyuan avatar Sep 07 '22 19:09 terrytangyuan

Hi @terrytangyuan i am new to open source and want to work on this issue . Can you assign it to me

gkum99 avatar Sep 09 '22 14:09 gkum99

@2022H1030014G you can raise a PR to fix this even without being assigned

tooptoop4 avatar Sep 10 '22 16:09 tooptoop4

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Oct 01 '22 06:10 stale[bot]

Hi @terrytangyuan i am new to open source and want to work on this issue . Can you assign it to me

@2022H1030014G Are you still working on it?

juliusvonkohout avatar Oct 06 '22 17:10 juliusvonkohout

Hi @terrytangyuan i am new to open source and want to work on this issue . Can you assign it to me

@2022H1030014G Are you still working on it?

If there's no PR, you can just start working on it.

terrytangyuan avatar Oct 06 '22 17:10 terrytangyuan