worker
worker copied to clipboard
enhance(http): add retry client
This aims to hopefully help workaround intermittent errors returned by the Vela server by adding a retryable http client created by go-retryablehttp. Do folks think we should override the default retry check to allow retrying on context.DeadlineExceeded
errors? I ask this since prior connectivity failures reported like https://github.com/go-vela/community/issues/434 specifically had deadline exceeded errors.
Codecov Report
Merging #280 (15555db) into master (8037fb0) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #280 +/- ##
=======================================
Coverage 77.73% 77.73%
=======================================
Files 67 67
Lines 4868 4868
=======================================
Hits 3784 3784
Misses 944 944
Partials 140 140
I'm a bit concerned by this change, especially with regards to log pushes. since we do that unordered and async it will become more likely that logs will not be complete, correct?
I think you bring up a really valid argument against the retry logic at the global level since specific endpoints like log pushes could have unexpected results. Wondering if a decent middle ground would be utilizing the standard http library for log specific endpoints and retry http library for everything else.
i know we do a final sweep to push logs (in the defer), but there again, it would be possible for the last or second to last log update to go through some retries while the final sweep already completed. i have no solid proof, but i think we're seeing some of that (rarely) today, but we might make that worse with this?
I'm thinking the current behavior is more often caused by processing times when a large number of logs are written near the end of the build. For example, purpose the following behavior in a build:
- Build starts
- 100 lines are written in 100 seconds
- 10,000 lines are written in 1 second
- 20 lines are written in 1 second
- build ends
FWIW, the only builds we have seen the truncated log lines follow the behavior I described.
I think you bring up a really valid argument against the retry logic at the global level since specific endpoints like log pushes could have unexpected results. Wondering if a decent middle ground would be utilizing the standard http library for log specific endpoints and retry http library for everything else.
I can get with that. I know the change won't be as clean as it is currently, but I would feel better about those potential adverse side effects.
FWIW, the only builds we have seen the truncated log lines follow the behavior I described.
Yea, that's likely true. Notwithstanding network blips and such that is probably the most common scenario.
Getting back to this PR. What are the thoughts with byte-chunks being out of the picture now? I think the issue should be less pronounced but likely not eliminated, so a separation as @JordanSussman pointed out probably still makes sense?
@wass3r I'm fine with punting this change to specific endpoints in the future. FWIW we have not seen any major issues without this change since the byte-chunks was changed.
@JordanSussman hey we're going to close this PR as stale, but we definitely want better timeout resiliency so i'm going to create a separate issue for it and link to this PR as a great starting place. https://github.com/go-vela/community/issues/863