Support retry when putting file to OSS and encounter 'dial tcp: i/o timeout' error

Open carolkao opened this issue 3 years ago • 3 comments

Summary

Sometimes the pod become failed due to putting main.log to OSS failure. The error showed in pod are:

2022-07-18T04:09:55.916Z OSS Save path: /tmp/argo/outputs/logs/main.log, key: 2022/07/18/testhhb9k/testhhb9k-4053391036/main.log
2022-07-18T04:10:25.923Z failed to put file: Put "https://xxx-artifact.oss-us-east-1-internal.aliyuncs.com/2022%2F07%2F18%2Ftesthhb9k%2Ftesthhb9k-4053391036%2Fmain.log": dial tcp: i/o timeout
2022-07-18T04:10:25.923Z executor error: Put \"https://xxx-artifact.oss-us-east-1-internal.aliyuncs.com/2022%2F07%2F18%2Fctesthhb9k%2Ftesthhb9k-4053391036%2Fmain.log\": dial tcp: i/o timeout"

It meet the timeout setting - 30s in the default config. I'd like to request for adding dial tcp: i/o timeout like error into this white list, so that the request can be retry again. https://github.com/argoproj/argo-workflows/blob/2d1758fe90fd60b37d0dfccb55c3f79d8a897289/workflow/artifacts/oss/oss.go#L37

Use Cases

Make the request can be retried automatically when encounter dial tcp: i/o timeout error.

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Aug 02 '22 07:08 carolkao

Would you like to send a PR and test it out? I'd also make sure your cluster has stable connection to that OSS address.

Aug 02 '22 15:08 terrytangyuan

Hi @terrytangyuan , thanks for your reply. I believe the connection between our cluster and OSS address is stable since the error rarely happens. (But feel inconvenience for us)

Our colleague @jingkkkkai has an interest in working on this issue. He'll follow up and send the PR once he done.

Aug 03 '22 01:08 carolkao

Sounds great.

Aug 03 '22 12:08 terrytangyuan