river icon indicating copy to clipboard operation
river copied to clipboard

Job timeout applied after middleware

Open bgentry opened this issue 10 months ago • 3 comments

While reading through the executor for another change, I realized that as of #632 / #584 we're applying the job timeout after stepping through all the middleware:

https://github.com/riverqueue/river/blob/fc667eb33b2427f28ca335f5287610edaa2b7e93/job_executor.go#L209-L227

I'm not sure if this was an intentional decision and I couldn't find any conversations where it was mentioned, so I thought we should at least talk about it. Do you think this is the right place for it, or should the timeout happen before the middleware stack gets called? I was thinking of a worst case scenario where a middleware got stuck or took awhile on a blocking operation and the job ended up running far longer than either the job's timeout or the client's default timeout.

bgentry avatar Feb 10 '25 22:02 bgentry

Hmm, so I didn't consider it one way or the other super deeply admittedly, but including the middleware in the timeout doesn't seem super obviously more or less correct to me.

An argument could be made that that when setting work timeouts, users are trying to set a timeout specifically on their code as opposed to River's internal code, which to their eye would be an implementation detail. So similarly how the timeout doesn't include the time it takes to complete a job, it doesn't include middleware, which is largely made up of other internal stuff.

brandur avatar Feb 12 '25 02:02 brandur

The risk I see is that as it currently is, the middleware have no timeout whatsoever applied to them. Beyond that I don't think it matters a lot either way, but I do worry about the risk of a middleware getting stuck and the job just running forever despite whatever timeouts are configured.

bgentry avatar Feb 12 '25 03:02 bgentry

Yeah I suppose that could be a danger depending on what you're trying to do in the middleware. Probably a separate timeout like you'd find in http.Transport (each phase of a request can be configured separately) would be as defensible as mixing in the same timeout being used for the job's work phase.

brandur avatar Feb 14 '25 21:02 brandur