This aims to hopefully help workaround intermittent errors returned by the Vela server by adding a retryable http client created by go-retryablehttp. Do folks think we should override the default retry check to allow retrying on context.DeadlineExceeded errors? I ask this since prior connectivity failures reported like https://github.com/go-vela/community/issues/434 specifically had deadline exceeded errors.

Mar 01 '22 05:03 JordanSussman

Codecov Report

Merging #280 (15555db) into master (8037fb0) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #280   +/-   ##
=======================================
  Coverage   77.73%   77.73%           
=======================================
  Files          67       67           
  Lines        4868     4868           
=======================================
  Hits         3784     3784           
  Misses        944      944           
  Partials      140      140

Mar 01 '22 05:03 codecov[bot]

I'm a bit concerned by this change, especially with regards to log pushes. since we do that unordered and async it will become more likely that logs will not be complete, correct?

I think you bring up a really valid argument against the retry logic at the global level since specific endpoints like log pushes could have unexpected results. Wondering if a decent middle ground would be utilizing the standard http library for log specific endpoints and retry http library for everything else.

i know we do a final sweep to push logs (in the defer), but there again, it would be possible for the last or second to last log update to go through some retries while the final sweep already completed. i have no solid proof, but i think we're seeing some of that (rarely) today, but we might make that worse with this?

I'm thinking the current behavior is more often caused by processing times when a large number of logs are written near the end of the build. For example, purpose the following behavior in a build:

Build starts
100 lines are written in 100 seconds
10,000 lines are written in 1 second
20 lines are written in 1 second
build ends

FWIW, the only builds we have seen the truncated log lines follow the behavior I described.

Mar 11 '22 16:03 JordanSussman

I think you bring up a really valid argument against the retry logic at the global level since specific endpoints like log pushes could have unexpected results. Wondering if a decent middle ground would be utilizing the standard http library for log specific endpoints and retry http library for everything else.

I can get with that. I know the change won't be as clean as it is currently, but I would feel better about those potential adverse side effects.

FWIW, the only builds we have seen the truncated log lines follow the behavior I described.

Yea, that's likely true. Notwithstanding network blips and such that is probably the most common scenario.

Mar 11 '22 19:03 wass3r

Getting back to this PR. What are the thoughts with byte-chunks being out of the picture now? I think the issue should be less pronounced but likely not eliminated, so a separation as @JordanSussman pointed out probably still makes sense?

Apr 26 '23 16:04 wass3rw3rk

@wass3r I'm fine with punting this change to specific endpoints in the future. FWIW we have not seen any major issues without this change since the byte-chunks was changed.

Apr 26 '23 16:04 JordanSussman

@JordanSussman hey we're going to close this PR as stale, but we definitely want better timeout resiliency so i'm going to create a separate issue for it and link to this PR as a great starting place. https://github.com/go-vela/community/issues/863

Oct 11 '23 14:10 plyr4

worker
worker copied to clipboard

enhance(http): add retry client

Codecov Report

worker worker copied to clipboard

enhance(http): add retry client

Codecov Report

worker
worker copied to clipboard