Handle GitHub API timeouts and retries
Currently, we send requests to the GitHub API without dealing with timeouts and retries in any way. We should make this more robust by adding some automatic retry/timeout mechanism:
- Implement timeouts for the requests. Say, 10s at most, before the request is marked as timeouted.
- Implement retry mechanism for GH API requests. If the request fails due to API rate limiting or a timeout, retry it again, ideally with some exponential backoff. See https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#checking-the-status-of-your-rate-limit and https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit.
- I don't think that this will be needed at first, but in the future we might consider using some queue for requests (https://docs.github.com/en/rest/using-the-rest-api/best-practices-for-using-the-rest-api?apiVersion=2022-11-28) to avoid too many concurrent requests. But I would only implement this if we find some issues in production.
Hi, I would like to work on this.
Hi, for the timeout part, I just need to check if the api call takes more than 10 sec then it needs to cancelled which can be done using tokio::time::timeout.
tokio::time::timeout is a bit low-level. We could use it if needed, but I suppose that reqwest or octocrab should have more fine-grained options for handling timeouts, e.g. read timeouts, connection timeouts, request timeouts, etc.
octocrabbuilder::{set_read_timeout, set_write_timeout} would be fine for this.
Yeah, something like that. Ideally, we should be able to find out from the error if it was actually a timeout though, so we could retry. I remember that octocrab had a bit opaque errors, so if it's not possible to figure it out from the error easily, then we might want to converge to the tokio timeout anyway.
Hi I went through the octocrab's error enum and also hyper-timeout crate which octocrab uses for timeout. So I think we can figure out that the request timed out by checking service variant of the enum. Also check a test of hyper-timeout here.
Uhh, downcast_ref is not exactly what I wanted to see in the codebase 😕 But I guess it's better than adding an additional timeout on top of an already existing timeout. So let's try it.
I think that for now this is good enough with https://github.com/rust-lang/bors/pull/267.