openproject
openproject copied to clipboard
[#48717] Replace DelayedJob with GoodJob.
https://community.openproject.org/work_packages/48717
TODOS:
- [x] Create PR for SaaS changes. It extends ApplicationJob (delayed_job_extender.rb) with tenant specific code and needs to be adapted
- [x] adapt https://github.com/opf/openproject/blob/dev/docker/prod/worker and https://github.com/opf/openproject/blob/dev/docker/prod/bimworker to work with good job
- [x] adapt https://github.com/opf/openproject/blob/dev/Procfile and https://github.com/opf/openproject/blob/dev/Procfile.dev
- [x] Adapt documentation that uses
rake jobs:work - [ ] Adapt health_checks
Hi @ba1ash, thanks for you efforts. I've added a number of todos that need to happen both in the core and the saas-openproject repo to be able to get this merged.
- [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id"
DETAIL: Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists.
):
- [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id" DETAIL: Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists. ):
I couldn't reproduce this anymore. It might be an error on my side due to the way I ran jobs manually
- [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id" DETAIL: Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists. ):I couldn't reproduce this anymore. It might be an error on my side due to the way I ran jobs manually
Probably, I tested locally and it works fine.
@ba1ash great work, this is looking good to me in the core. SaaS/Deployment is missing some changes that I need to check with Markus tomorrow. So let's hold off merging this just yet.
Great! Thanks for reviewing @oliverguenther!
- [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id" DETAIL: Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists. ):
@ba1ash This error happened to me again just now, and to Markus as well. I think there is something wrong here after all.
It looks good over all (also the multitenancy PR) but I'm still looking into why I'm getting duplicate key errors on upserting the job status.
On a positive side note, I've tested the case where the worker is killed while working off a job. In delayed job that job will just be lost, because there is only the "locked_by" field. And when the worker is restarted the job won't be re-tried again because the job is already locked.
Good job has two separate columns, started_at and finished_at so it when the worker is killed, the latter one will not be set and the job will be re-retried eventually. That's pretty neat, just like the job dashboard. Really nice!
- [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id" DETAIL: Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists. ):@ba1ash This error happened to me again just now, and to Markus as well. I think there is something wrong here after all.
Okay, I will try to walk through the copying process more carefully.
- [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR: duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id" DETAIL: Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists. ):@ba1ash This error happened to me again just now, and to Markus as well. I think there is something wrong here after all.
Okay, I will try to walk through the copying process more carefully.
I'm thinking this is a multithreading issue. I can reproduce the problem quite often.
The code fails in ApplicationJobWithStatus#upsert_status where it tries to update the job status to in_process [sic].
There while there already is a record with the 'in_queue' status, resource.new_record? is actually true. Which is not correct.
We may need some sort of double-checked locking here.
We could simply catch ActiveRecord::RecordNotUnique here and re-try the whole method. That would probably already solve the issue.
Although the better solution, IMHO, would be to use plain SQL and use what amounts to the following during the update statement.
ON CONFLICT (job_id) DO UPDATE SET serialized_params = ?
Thanks for addressing most of the changes I requested. Could you please have a look at the few new comments I made?
Also one question: What happens to existing delayed jobs in the course of this migration? As far as I can tell they would just be deleted and not translated into new good jobs.
But wouldn't that be necessary? I'm just imagining a scenario where a bunch of
Journals::CompletedJobs are scheduledSetting.journal_aggregation_time_minuteslater while the update is being performed.I just want to make sure it wouldn't be a problem if these jobs are lost. Would it it just be notifications (including web hooks) that are lost? @ulferts what do you say? Is that ok or shouldn't we migrate existing delayed jobs into the new good_job table? One could argue that even lost webhooks aren't optimal. But I'm wondering if I'm missing anything else.
True. They will be deleted. My assumption was that we can afford losing jobs. If we can't I will prepare a migration.