openproject icon indicating copy to clipboard operation
openproject copied to clipboard

[#48717] Replace DelayedJob with GoodJob.

Open ba1ash opened this issue 2 years ago • 1 comments

https://community.openproject.org/work_packages/48717

TODOS:

  • [x] Create PR for SaaS changes. It extends ApplicationJob (delayed_job_extender.rb) with tenant specific code and needs to be adapted
  • [x] adapt https://github.com/opf/openproject/blob/dev/docker/prod/worker and https://github.com/opf/openproject/blob/dev/docker/prod/bimworker to work with good job
  • [x] adapt https://github.com/opf/openproject/blob/dev/Procfile and https://github.com/opf/openproject/blob/dev/Procfile.dev
  • [x] Adapt documentation that uses rake jobs:work
  • [ ] Adapt health_checks

ba1ash avatar Jun 30 '23 12:06 ba1ash

Hi @ba1ash, thanks for you efforts. I've added a number of todos that need to happen both in the core and the saas-openproject repo to be able to get this merged.

oliverguenther avatar Jan 31 '24 08:01 oliverguenther

  • [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id"
DETAIL:  Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists.
):

oliverguenther avatar Feb 19 '24 19:02 oliverguenther

  • [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id"
DETAIL:  Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists.
):

I couldn't reproduce this anymore. It might be an error on my side due to the way I ran jobs manually

oliverguenther avatar Feb 20 '24 10:02 oliverguenther

  • [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id"
DETAIL:  Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists.
):

I couldn't reproduce this anymore. It might be an error on my side due to the way I ran jobs manually

Probably, I tested locally and it works fine.

ba1ash avatar Feb 20 '24 10:02 ba1ash

@ba1ash great work, this is looking good to me in the core. SaaS/Deployment is missing some changes that I need to check with Markus tomorrow. So let's hold off merging this just yet.

Great! Thanks for reviewing @oliverguenther!

ba1ash avatar Feb 20 '24 12:02 ba1ash

  • [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id"
DETAIL:  Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists.
):

@ba1ash This error happened to me again just now, and to Markus as well. I think there is something wrong here after all.

oliverguenther avatar Feb 21 '24 14:02 oliverguenther

It looks good over all (also the multitenancy PR) but I'm still looking into why I'm getting duplicate key errors on upserting the job status.

On a positive side note, I've tested the case where the worker is killed while working off a job. In delayed job that job will just be lost, because there is only the "locked_by" field. And when the worker is restarted the job won't be re-tried again because the job is already locked.

Good job has two separate columns, started_at and finished_at so it when the worker is killed, the latter one will not be set and the job will be re-retried eventually. That's pretty neat, just like the job dashboard. Really nice!

image

machisuji avatar Feb 21 '24 15:02 machisuji

  • [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id"
DETAIL:  Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists.
):

@ba1ash This error happened to me again just now, and to Markus as well. I think there is something wrong here after all.

Okay, I will try to walk through the copying process more carefully.

ba1ash avatar Feb 21 '24 15:02 ba1ash

  • [ ] Trying to instantiate a project copy, I got this error on first try:
E, [2024-02-19T20:37:15.941960 #83213] ERROR -- : [ActiveJob] [CopyProjectJob] [0e0cdbb3-af42-4717-b907-a05249a89ef6] Error performing CopyProjectJob (Job ID: 0e0cdbb3-af42-4717-b907-a05249a89ef6) from GoodJob(default) in 54.13ms: ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_delayed_job_statuses_on_job_id"
DETAIL:  Key (job_id)=(0e0cdbb3-af42-4717-b907-a05249a89ef6) already exists.
):

@ba1ash This error happened to me again just now, and to Markus as well. I think there is something wrong here after all.

Okay, I will try to walk through the copying process more carefully.

I'm thinking this is a multithreading issue. I can reproduce the problem quite often. The code fails in ApplicationJobWithStatus#upsert_status where it tries to update the job status to in_process [sic]. There while there already is a record with the 'in_queue' status, resource.new_record? is actually true. Which is not correct.

We may need some sort of double-checked locking here. We could simply catch ActiveRecord::RecordNotUnique here and re-try the whole method. That would probably already solve the issue.

Although the better solution, IMHO, would be to use plain SQL and use what amounts to the following during the update statement.

ON CONFLICT (job_id) DO UPDATE SET serialized_params = ?

machisuji avatar Feb 21 '24 15:02 machisuji

Thanks for addressing most of the changes I requested. Could you please have a look at the few new comments I made?

Also one question: What happens to existing delayed jobs in the course of this migration? As far as I can tell they would just be deleted and not translated into new good jobs.

But wouldn't that be necessary? I'm just imagining a scenario where a bunch of Journals::CompletedJobs are scheduled Setting.journal_aggregation_time_minutes later while the update is being performed.

I just want to make sure it wouldn't be a problem if these jobs are lost. Would it it just be notifications (including web hooks) that are lost? @ulferts what do you say? Is that ok or shouldn't we migrate existing delayed jobs into the new good_job table? One could argue that even lost webhooks aren't optimal. But I'm wondering if I'm missing anything else.

True. They will be deleted. My assumption was that we can afford losing jobs. If we can't I will prepare a migration.

ba1ash avatar Feb 29 '24 15:02 ba1ash