ws-export icon indicating copy to clipboard operation
ws-export copied to clipboard

Add job queue system to generate books

Open samwilson opened this issue 1 year ago • 6 comments

Use the Symfony Messenger component to pass ebook generation requests from the web front-end to a server-processed queue.

Bug: T345406

samwilson avatar Dec 18 '23 06:12 samwilson

Thank you!

I am afraid this patch will encounter a problem: if the messenger system works like a queue (first in first out) then if the submission rate is faster than the processing rate the queue will grow unbounded and the workers will only process old requests for which an error 500 has already been returned.

Some ideas to mitigate that:

  • drop jobs if older than x seconds
  • move to a stack (last in first out). However this gives more chances to a user that will spam the server with requests for the same book...
  • prioritise cheap jobs: we might choose to only process pdfs if all epub requests have been processed to make sure all cheap operations are done.

We also have a load balancing problem: It is likely epub jobs are network bounds and PDF CPU bounds, having the same worker pool might leads to wasted CPU if we process a lot of epubs or overloading if we process a lot of PDFs.

Tpt avatar Dec 18 '23 20:12 Tpt

Good points, thanks!

We could have two queues, one for Epub and one for PDF (or all derived formats).

But that doesn't directly help with the problem of not processing jobs fast enough. Dropping jobs that were requested more than 30 seconds ago sound like it might be the best option — I guess ultimately this is the crux of the issue, that we don't have the resources to attend to every request, but at the moment we don't have a system of failing before running out of resources. If we drop old requests, then many users would get generation-failed errors, but the service would stay up and serve as many as it can. (And if we want to serve more, we'd add separate servers.)

Also, it seems like ideally we'd have a unique-job structure, where it wouldn't matter if someone spams requests, if they're for the same book+options then it'd just use the existing job. That can't work unless we store the generated book, so perhaps that should be a subsequent change? Or am I thinking about this wrongly?

Anyway, I'll start by a) using two queues; and b) dropping old jobs after 30 seconds.

samwilson avatar Dec 19 '23 04:12 samwilson

Dropping jobs that were requested more than 30 seconds ago sound like it might be the best option

What about the case "we have a lot of jobs in the queue so we process only the jobs that are just before ttl aka 30s. However the time to return a 500 error is also ~30s so we only process jobs for which it is very likely we already returned a 500 error before finishing the processing.

Instead of a complex queue what about:

  • a rate limiter that rejects requests if there are more than X requests currently processing. This would allow to either fail quickly or get a very high chance of succeeding the conversion. For that we would need a global atomic counter. I believe Memcache::increment and Memcache::decrement might help with that. In a follow up, we can refine the counters by allowing a smaller number of PDFs than ePubs.
  • a cache system. When converting a file we save it to disk. When having a same request for the same file we return the on-disk file if it exists and we "touch" it. We write cron job to remove too old files and limit the number of files. We also provide a way to purge the cache for editors who want to test their changes.

Sorry for not coming up with this suggestion sooner. What do you think about it?

Tpt avatar Dec 19 '23 09:12 Tpt

That does sound simpler. :-)

The counter could use the existing cache service couldn't it? Am I understanding it correctly in that it'll have to a) increment when we start a conversion; b) throw an error when the count is over say 4; and c) decrement after creating the file.

samwilson avatar Dec 20 '23 08:12 samwilson

The counter could use the existing cache service couldn't it? Am I understanding it correctly in that it'll have to a) increment when we start a conversion; b) throw an error when the count is over say 4; and c) decrement after creating the file.

Yes! A tricky thing is to make sure to properly decrement the counter if the file creation process fails (even with things like out of memory errors) to avoid having "zombies" in the counter.

Tpt avatar Dec 20 '23 08:12 Tpt

Cool. I'm working on a patch. :)

Do you think it'd be good to add the timestamp of each instance of the counter? That way we could tell how long ago it was incremented and if it was too long ago it could probably be disregarded and reused. e.g. a cache item called convert_processes or something, with an array of limited length with each value being a timestamp.

samwilson avatar Dec 20 '23 10:12 samwilson