Retry API request when DNS resolution of sentry.io fails
Problem Statement
This is a follow-up to #2177: API request failed caused by: [6] Couldn't resolve host name (Could not resolve host: sentry.io)
We are seeing this when uploading symbol files from Linux hosts, with a failure rate of <10% (4 of the previous 50 invocations, at time of writing).
I suspect it is caused by unreliable .io TLD DNS servers that is somehow amplified by some combination of a DNS stack on Linux.
If I'm grokking the retry logic correctly, it only attempts retries on certain HTTP status codes, and does not cover the cases from other parts of the stack:
const RETRY_STATUS_CODES: &[u32] = &[
http::HTTP_STATUS_502_BAD_GATEWAY,
http::HTTP_STATUS_503_SERVICE_UNAVAILABLE,
http::HTTP_STATUS_504_GATEWAY_TIMEOUT,
http::HTTP_STATUS_507_INSUFFICIENT_STORAGE,
http::HTTP_STATUS_524_CLOUDFLARE_TIMEOUT,
];
// ...
pub fn send(mut self) -> ApiResult<ApiResponse> {
// -- snip --
loop {
let mut out = vec![];
debug!("retry number {retry_number}, max retries: {max_retries}",);
let mut rv = self.send_into(&mut out)?;
if retry_number >= max_retries || !RETRY_STATUS_CODES.contains(&rv.status) {
rv.body = Some(out);
return Ok(rv);
}
// -- snip --
}
}
Implementing retry to DNS resolution failure should alleviate this issue.
Solution Brainstorm
No response
@yangskyboxlabs, this sounds like a good idea, I will place the issue on our backlog.
Implementation note
Seems like we can use this function to check whether the error is a DNS resolution error. We would need to downcast the APIError (via the source field) to the curl Error type.
Just thought I'd add that we're also seeing this for around 5 - 10% of builds when artifacts are uploaded using @sentry/vite-plugin. The servers that experience this otherwise have no issues with DNS resolution, and are using Cloudflare's 1.1.1.1 DNS servers, so they're proobably pretty reliable.
We have the same issue, our builds are sometimes randomly failing due to DNS resolution issues...
Thanks for letting us know, I am increasing this issue's priority on our internal backlog