osbuild-composer Add timeouts to http.Server APIs

While fixing the linting for #2973 I noticed there were no timeouts, so I tried adding them. I'm actually unsure if they are really needed since none of these endpoints are exposed directly, right?

I'm also seeing some odd log entries. It looks like it is constantly trying to make requests for the local worker. I see these:

Sep 13 10:44:00 fedora osbuild-worker[666]: time="2022-09-13T10:44:00-07:00" level=error msg="Requesting job failed: error requesting job: 503 — Timeout exceeded (IMAGE-BUILDER-LOCAL-WORKER-1008)"
Sep 13 10:44:00 fedora osbuild-worker[666]: time="2022-09-13T10:44:00-07:00" level=warning msg="Received error from RequestAndRunJob, backing off"
Sep 13 10:44:41 fedora osbuild-worker[666]: time="2022-09-13T10:44:41-07:00" level=error msg="Requesting job failed: error requesting job: 503 — Timeout exceeded (IMAGE-BUILDER-LOCAL-WORKER-1008)"
Sep 13 10:44:41 fedora osbuild-worker[666]: time="2022-09-13T10:44:41-07:00" level=warning msg="Received error from RequestAndRunJob, backing off"

every minute (the timeout is 60 seconds for the workers).

So basically, this PR is for discussion only right now :)

P.S. This is based on #2973, so that's why so many commits.

Sep 13 '22 17:09 bcl

Thanks for looking into this. I'm not sure if we need this on this layer though. On-premise - I don't care that much tbh. The service is hosted in OpenShift and we have some global timeouts there on the proxy level - 30 seconds if I'm not mistaken. @Gundersanne knows more.

The errors that you are seeing, are caused because workers poll for new jobs indefinitely before this change. That means, that composer receives a job request from a worker, but it sends a response only after there's a new job in the queue. Since there were no timeouts before, this request would have just hanged "forever", which is actually expected and very resource-effective.

By implementing timeouts, the worker will need to start these requests regularly and this why you see these messages in the log.

Sep 15 '22 10:09 ondrejbudai

The errors that you are seeing, are caused because workers poll for new jobs indefinitely before this change. That means, that composer receives a job request from a worker, but it sends a response only after there's a new job in the queue.

Ah, ok. I suspected that might be the case.

Sep 15 '22 16:09 bcl

So I wonder, do we want to add timeouts to everything except the Worker API? I think it still might be valuable.

Sep 19 '22 09:09 ondrejbudai

So I wonder, do we want to add timeouts to everything except the Worker API? I think it still might be valuable.

I'm undecided :) For the weldr API's local socket anything misbehaving (on purpose or on accident) is already on the system. And you need a longer timeout for that initial depsolve/metadata download if you keep socket activation.

And I don't know enough about the cloud API to say whether or not they would be useful or what a safe timeout would be -- if it is used behind a proxy then I guess it doesn't matter.

Sep 19 '22 15:09 bcl

Adding timeouts to all handlers is great. While the service is hosted in a specific way for us that might not always be the case or configuration of proxies and infrastructure might change having unintended effects. I even think it makes sense to have these timeouts on the worker API, HTTP connections are really not meant for lingering and retrying a request every minute (or a longer timeout) is not a big resource hog.

Could we use HTTP 408? It doesn't fit directly but it's a timeout error from the application.

Aside from that I'd like a followup which would make this timeout configurable per handler (is that something one can define in swagger specs?).

Sep 21 '22 06:09 supakeen

Adding timeouts to all handlers is great. While the service is hosted in a specific way for us that might not always be the case or configuration of proxies and infrastructure might change having unintended effects. I even think it makes sense to have these timeouts on the worker API, HTTP connections are really not meant for lingering and retrying a request every minute (or a longer timeout) is not a big resource hog.

I'm sorry, but I disagree with the lingering, XMPP uses exactly this: https://en.wikipedia.org/wiki/BOSH_(protocol)

(I had to implement this lovely thing in my previous job.)

Could we use HTTP 408? It doesn't fit directly but it's a timeout error from the application.

If we decide for adding retries, I agree with 408.

I'm still torn tbh, I would argue that the APIs can be used in the following scenarios:

locally - I'm not convinced that timeouts are very useful in these cases - at least for the use case of preventing certain attacks.
remotely - I think that https is the only way nowadays. osbuild-composer is not meant to be a TLS termination point. We currently have a partial support for this, but I would love to rip that out as soon as possible (it's no longer used). I just don't want us to be responsible for doing a TLS server, we are not experts in this field that evolves continuously. There are better projects that do this. This means that osbuild-composer is always meant to be used behind a reverse-proxy if exposed on a network. TBH, it now feels like a documentation issue - if you want to run a network instance of osbuild-composer, put it behind a reverse proxy with proper timeouts and TLS settings. I'm happy to merge such PR.

Sep 21 '22 11:09 ondrejbudai

Hehe, let's not detract from the timeouts with my ideas about lingering HTTP (never a fan of neither BOSH, COMET, or any of the other things we hacked on top of HTTP to get streaming).

Concrete question, if the loadbalancer terminates the request to the client, or closes the connection to the application will the goroutine that's handling the request be cancelled immediately or will it fail later when it tries to write to the (now closed) socket?

Sep 21 '22 11:09 supakeen

This PR is stale because it has been open 30 days with no activity. Remove "Stale" label or comment or this will be closed in 7 days.

Sep 08 '23 04:09 github-actions[bot]

This PR was closed because it has been stalled for 30+7 days with no activity.

Sep 16 '23 04:09 github-actions[bot]