[IMP] queue_job: HA job runner using session level advisory lock
Another attempt.
Hi @guewen, some modules you are maintaining are being modified, check this out!
Yep, this should work.
@PCatinean do you know who we should ping in the odoo.sh team to have an opinion on this approach?
Hi @sbidoul the only two people I know around this topic are @amigrave which gave the initial feedback on the advisory lock MR here https://github.com/OCA/queue/pull/256 and @sts-odoo which also provided some feedback on the pg_application_name approach
@amigrave @sts-odoo so the TL;DR here is that we have one long lived connection to the database on which we take a session-level advisory lock and do a LISTEN. There is no long-lived transaction, so this should not impact replication.
I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.
There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days. If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.
I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.
@sbidoul any feedback?
Feeback given in https://github.com/OCA/queue/pull/673#issuecomment-2522520503.
And rebased.
sorry, why is this not merged yet?
@simahawk @sbidoul ,
I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.
@sbidoul any feedback?
I'd like to run this on my staging and production GKE cluster. I would especially like to test the scaling capabilities of this in my staging environment. If I deploy this to my staging env, would either of you like the keys to my staging env and the GKE staging cluster to kick the tires and load test this with K6 or similar tools?
I would love to see this merged, and would be happy to run this in production after some load testing in staging and report back on results or allow you to monitor.
I can reach out to you via email to get this going through your company's official channels if this is something you'd like to explore.
Hi everyone. This is not merged precisely because we would like more feedback from actual deployments. Tests are ongoing at Acsone, and I would encourage others to do the same.
Hi everyone. This is not merged precisely because we would like more feedback from actual deployments. Tests are ongoing at Acsone, and I would encourage others to do the same.
Thanks. I'll get this into staging and then production and report back with findings.
@sbidoul extremely silly question from my end, but is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?
I'm pulling this into our staging environment now, but if all goes well I do plan to run this in production over a few weeks and report back if I encounter issues.
Code LGTM. I'm going to install it on one of my projects and battle test it.
It has been deployed in production for almost 3 weeks and I haven't any issue to report
is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?
@luke-stdev001 yes it should be safe. I just rebased.
is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?
@luke-stdev001 yes it should be safe. I just rebased.
Thank you.
https://github.com/OCA/queue/issues/422#issuecomment-2628351937
Yes, same config on all instances will work.
@sbidoul ,
Seems to be working well from initial load testing in staging, thank you.
I'd like to rearchitect our GKE based HA deployment of Odoo:
- User-facing app instances with server wide modules without queue job and with HTTP workers
- Larger single vertically scaled instance for cron workers, no HTTP workers (current architecture we have has queue jobs on this same server for the same reasons)
- Queue job only instance with server wide modules with queue job and with no HTTP workers, dedicated node pool to scale out queue instances separately to user-facing instances
My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.
If you wouldn't mind could you confirm if that assumption is correct? My apologies if there are any fundamental misunderstandings on my side to how this works.
Once i've had a week to toy with the concept in staging i'll deploy to production and advise on progress.
My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.
You can do that yes. I'm curious about the metrics you plan to use for auto scaling.
Note this was already feasible without this PR, with a single dedicated pod with --load=queue_job sending the requests to run jobs to the pool. This PR simplifies configuration, helps avoiding configuration mistakes, and also helps in situations where you cannot have instances with different configs such as odoo.sh.
My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.
You can do that yes. I'm curious about the metrics you plan to use for auto scaling.
Note this was already feasible without this PR, with a single dedicated pod with
--load=queue_jobsending the requests to run jobs to the pool. This PR simplifies configuration, helps avoiding configuration mistakes, and also helps in situations where you cannot have instances with different configs such as odoo.sh.
Thanks, I wasn't aware of that ability previously, i'll take a look.
To be perfectly honest when it comes to auto-scaling metrics we will be figuring it out as we go along and playing with what works and learning what doesn't. I'm happy to report back here with our own internal notes and would love to hear from anyone else who has any advice.
I am thinking at present perhaps WSGI request queue length, request rate, request duration latency, ratio of busy workers to total number of workers, and then DB connection pool saturation.
I'm happy to report back that this PR is working fine in production, and has been for a few days. I will monitor closely for issues, but so far I have not encountered any hiccups.
I think we have tested sufficiently. This PR is ready to merge.
/ocabot merge minor
This PR looks fantastic, let's merge it! Prepared branch 16.0-ocabot-merge-pr-668-by-sbidoul-bump-minor, awaiting test results.
Congratulations, your PR was merged at b2ff50d7bc070fdc337b5839edb2c66eb7ef091f. Thanks a lot for contributing to OCA. ❤️
hi @sbidoul , since this commit makes queue job capable of running with an instance having multiple databases, I see some scenarios where cloning a database with failed jobs causing job runner failed to initialized. I believe it was due to this check, although the db is already taken from from url with this pr https://github.com/OCA/queue/pull/504.
https://github.com/OCA/queue/blame/7c20ae04369cc028f76e4111985882cbf7688168/queue_job/jobrunner/channels.py#L1010
Do you have any thoughts on this?
@hoangtrann queue_job has never supported adding databases without restarting. If the HA mechanism is breaking in that circumstances, though, we should investigate. Can you open another issue with reproduction details?