Support retry when putting file to OSS and encounter 'dial tcp: i/o timeout' error
Summary
Sometimes the pod become failed due to putting main.log to OSS failure. The error showed in pod are:
2022-07-18T04:09:55.916Z OSS Save path: /tmp/argo/outputs/logs/main.log, key: 2022/07/18/testhhb9k/testhhb9k-4053391036/main.log
2022-07-18T04:10:25.923Z failed to put file: Put "https://xxx-artifact.oss-us-east-1-internal.aliyuncs.com/2022%2F07%2F18%2Ftesthhb9k%2Ftesthhb9k-4053391036%2Fmain.log": dial tcp: i/o timeout
2022-07-18T04:10:25.923Z executor error: Put \"https://xxx-artifact.oss-us-east-1-internal.aliyuncs.com/2022%2F07%2F18%2Fctesthhb9k%2Ftesthhb9k-4053391036%2Fmain.log\": dial tcp: i/o timeout"
It meet the timeout setting - 30s in the default config. I'd like to request for adding dial tcp: i/o timeout like error into this white list, so that the request can be retry again.
https://github.com/argoproj/argo-workflows/blob/2d1758fe90fd60b37d0dfccb55c3f79d8a897289/workflow/artifacts/oss/oss.go#L37
Use Cases
Make the request can be retried automatically when encounter dial tcp: i/o timeout error.
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.
Would you like to send a PR and test it out? I'd also make sure your cluster has stable connection to that OSS address.
Hi @terrytangyuan , thanks for your reply. I believe the connection between our cluster and OSS address is stable since the error rarely happens. (But feel inconvenience for us)
Our colleague @jingkkkkai has an interest in working on this issue. He'll follow up and send the PR once he done.
Sounds great.