queue [IMP] queue_job: HA job runner using session level advisory lock

Another attempt.

Jul 02 '24 11:07 sbidoul

Hi @guewen, some modules you are maintaining are being modified, check this out!

Jul 02 '24 11:07 OCA-git-bot

Yep, this should work.

Jul 02 '24 12:07 sbidoul

@PCatinean do you know who we should ping in the odoo.sh team to have an opinion on this approach?

Jul 04 '24 13:07 sbidoul

Hi @sbidoul the only two people I know around this topic are @amigrave which gave the initial feedback on the advisory lock MR here https://github.com/OCA/queue/pull/256 and @sts-odoo which also provided some feedback on the pg_application_name approach

Jul 04 '24 13:07 PCatinean

@amigrave @sts-odoo so the TL;DR here is that we have one long lived connection to the database on which we take a session-level advisory lock and do a LISTEN. There is no long-lived transaction, so this should not impact replication.

I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.

Jul 04 '24 16:07 sbidoul

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days. If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

Nov 03 '24 12:11 github-actions[bot]

I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.

@sbidoul any feedback?

Dec 06 '24 07:12 simahawk

Feeback given in https://github.com/OCA/queue/pull/673#issuecomment-2522520503.

And rebased.

Dec 06 '24 08:12 sbidoul

sorry, why is this not merged yet?

Jan 09 '25 13:01 0yik

@simahawk @sbidoul ,

I plan to deploy this on a odoo.sh dev env soon to see how it goes. I can PM you the details if you wish to monitor something.

@sbidoul any feedback?

I'd like to run this on my staging and production GKE cluster. I would especially like to test the scaling capabilities of this in my staging environment. If I deploy this to my staging env, would either of you like the keys to my staging env and the GKE staging cluster to kick the tires and load test this with K6 or similar tools?

I would love to see this merged, and would be happy to run this in production after some load testing in staging and report back on results or allow you to monitor.

I can reach out to you via email to get this going through your company's official channels if this is something you'd like to explore.

Jan 20 '25 13:01 luke-stdev001

Hi everyone. This is not merged precisely because we would like more feedback from actual deployments. Tests are ongoing at Acsone, and I would encourage others to do the same.

Jan 20 '25 13:01 sbidoul

Hi everyone. This is not merged precisely because we would like more feedback from actual deployments. Tests are ongoing at Acsone, and I would encourage others to do the same.

Thanks. I'll get this into staging and then production and report back with findings.

Jan 21 '25 03:01 luke-stdev001

@sbidoul extremely silly question from my end, but is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?

I'm pulling this into our staging environment now, but if all goes well I do plan to run this in production over a few weeks and report back if I encounter issues.

Feb 28 '25 05:02 luke-stdev001

Code LGTM. I'm going to install it on one of my projects and battle test it.

It has been deployed in production for almost 3 weeks and I haven't any issue to report

Feb 28 '25 07:02 AnizR

is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?

@luke-stdev001 yes it should be safe. I just rebased.

Feb 28 '25 07:02 sbidoul

is it safe to just pull the changes in this branch and upgrade if we're just running on the vanilla OCA/queue modules in 16.0?

@luke-stdev001 yes it should be safe. I just rebased.

Thank you.

https://github.com/OCA/queue/issues/422#issuecomment-2628351937

Yes, same config on all instances will work.

@sbidoul ,

Seems to be working well from initial load testing in staging, thank you.

I'd like to rearchitect our GKE based HA deployment of Odoo:

User-facing app instances with server wide modules without queue job and with HTTP workers
Larger single vertically scaled instance for cron workers, no HTTP workers (current architecture we have has queue jobs on this same server for the same reasons)
Queue job only instance with server wide modules with queue job and with no HTTP workers, dedicated node pool to scale out queue instances separately to user-facing instances

My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.

If you wouldn't mind could you confirm if that assumption is correct? My apologies if there are any fundamental misunderstandings on my side to how this works.

Once i've had a week to toy with the concept in staging i'll deploy to production and advise on progress.

Mar 01 '25 02:03 luke-stdev001

My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.

You can do that yes. I'm curious about the metrics you plan to use for auto scaling.

Note this was already feasible without this PR, with a single dedicated pod with --load=queue_job sending the requests to run jobs to the pool. This PR simplifies configuration, helps avoiding configuration mistakes, and also helps in situations where you cannot have instances with different configs such as odoo.sh.

Mar 02 '25 10:03 sbidoul

My understanding is that with the DB managing leader election it should be perfectly acceptable to have a dedicated auto-scaling pool of queue job only instances for distributing jobs to, that can scale up/down with demand, while leaving user-facing instances unaffected performance-wise.

You can do that yes. I'm curious about the metrics you plan to use for auto scaling.

Note this was already feasible without this PR, with a single dedicated pod with --load=queue_job sending the requests to run jobs to the pool. This PR simplifies configuration, helps avoiding configuration mistakes, and also helps in situations where you cannot have instances with different configs such as odoo.sh.

Thanks, I wasn't aware of that ability previously, i'll take a look.

To be perfectly honest when it comes to auto-scaling metrics we will be figuring it out as we go along and playing with what works and learning what doesn't. I'm happy to report back here with our own internal notes and would love to hear from anyone else who has any advice.

I am thinking at present perhaps WSGI request queue length, request rate, request duration latency, ratio of busy workers to total number of workers, and then DB connection pool saturation.

I'm happy to report back that this PR is working fine in production, and has been for a few days. I will monitor closely for issues, but so far I have not encountered any hiccups.

Mar 03 '25 12:03 luke-stdev001

I think we have tested sufficiently. This PR is ready to merge.

Mar 03 '25 17:03 sbidoul

/ocabot merge minor

Mar 17 '25 12:03 sbidoul

This PR looks fantastic, let's merge it! Prepared branch 16.0-ocabot-merge-pr-668-by-sbidoul-bump-minor, awaiting test results.

Mar 17 '25 12:03 OCA-git-bot

Congratulations, your PR was merged at b2ff50d7bc070fdc337b5839edb2c66eb7ef091f. Thanks a lot for contributing to OCA. ❤️

Mar 17 '25 12:03 OCA-git-bot

hi @sbidoul , since this commit makes queue job capable of running with an instance having multiple databases, I see some scenarios where cloning a database with failed jobs causing job runner failed to initialized. I believe it was due to this check, although the db is already taken from from url with this pr https://github.com/OCA/queue/pull/504.

https://github.com/OCA/queue/blame/7c20ae04369cc028f76e4111985882cbf7688168/queue_job/jobrunner/channels.py#L1010

Do you have any thoughts on this?

Nov 20 '25 01:11 hoangtrann

@hoangtrann queue_job has never supported adding databases without restarting. If the HA mechanism is breaking in that circumstances, though, we should investigate. Can you open another issue with reproduction details?

Nov 20 '25 07:11 sbidoul