sidekiq-unique-jobs icon indicating copy to clipboard operation
sidekiq-unique-jobs copied to clipboard

Problems with reaper and long running jobs

Open RobsonKing opened this issue 1 year ago • 11 comments

Describe the bug We have jobs that occasionally take a long time to execute (~20 minutes). These jobs are using until_and_while_executing. After about a minute a duplicate job start running. The lock appears to be removed by the reaper (our reaper interval is 1 minute).

Expected behavior Uniqueness should be maintained, even if the job runs for a long time.

Current behavior

  • First job starts executing
  • Second job is enqueued, gets blocked and put on the scheduled queue to retry. So far everything is working as expected
TestingUniqueJob jid=e888a4c9373cceef87a7e834 INFO: start
TestingUniqueJob JID-e888a4c9373cceef87a7e834]With args [31] 
TestingUniqueJob JID-e888a4c9373cceef87a7e834 Enqueued TestingUniqueJob with [31] on default (85c4ea4da6a2f6df530a4178)
TestingUniqueJob jid=e888a4c9373cceef87a7e834 elapsed=0.057 INFO: done
  • reaper runs and deletes the lock
uniquejobs=reaper INFO: Nothing to delete; exiting.
uniquejobs=reaper INFO: Deleting batch with 1 digests
INFO: (2023-05-04 21:57:00 +0000) Execution successfully returned 1
  • job that was being blocked and rescheduled now starts executing.

Worker class

class TestingUniqueJob <ApplicationSidekiqWorker
  sidekiq_options queue: 'default',
                  lock: :until_and_while_executing,
                  on_conflict: { client: :log, server: :reschedule }

  def perform(id)
    Rails.logger.info("*** [TestingUniqueJob] processing #{id} START")
    sleep(40 * 60)
    Rails.logger.info("*** [TestingUniqueJob] processing #{id} END")
  end
end

Questions

  • Is this behaviour expected?
  • Is there a way to prevent the reaper from deleting locks when jobs are still executing?
  • How does the reaper determine that a lock can be cleaned up?

RobsonKing avatar May 05 '23 15:05 RobsonKing

First of all, long running jobs are like ddosing yourself.

https://twitter.com/nateberkopec/status/1635280301709746179?s=46

https://www.linkedin.com/posts/nate-berkopec-6832b41a8_long-running-background-jobs-60-seconds-activity-7037074319537016832-fite?utm_source=share&utm_medium=member_ios

That said, I'm sure there is something that could be done to improve.

Unfortunately, I am out of commission for a few weeks so won't be able to look at it until end of May at best.

mhenrixon avatar May 06 '23 08:05 mhenrixon

We are aware that running long jobs is not ideal... In this case we only have one... but when the lock is cleaned up and other jobs start to run, we do DDOS ourselves.

Could you provide any info about how the reaper decides to clean something up? Is it just based on how old the lock is? Or does it try and check that the job is still running?

Looking through the code, it looks like belongs_to_job? tries looks to see if the digest belongs to a job that still exists. It looks like active? tries to see if the job is still active. Is this where my problem is likely to be found?

RobsonKing avatar May 08 '23 15:05 RobsonKing

It does something like this:

  1. Considers the lock active if not enough time has passed
  2. Checks what is enqueued
  3. Checks what is scheduled
  4. Checks what the workers are processing

May I ask what version you are on? I just fixed exactly your problem the other week.

mhenrixon avatar May 08 '23 16:05 mhenrixon

We are on version 7.1.29

Are you referring to https://github.com/mhenrixon/sidekiq-unique-jobs/pull/756 ?

Did this change get ported back into the version 7 branch?

RobsonKing avatar May 08 '23 19:05 RobsonKing

I don't believe I backported it. Sorry, been away clearing my head and meditating for two weeks in silence. I'll see if can get to this sometime this week.

mhenrixon avatar May 23 '23 18:05 mhenrixon

This issue is negating the core function of the gem, because locks are disappearing. Upgrading to 8 isn't really a viable option because other middlewares haven't fully caught up to the latest Sidekiq yet. Looking forward to the backport.

jpriollaud avatar Jun 28 '23 23:06 jpriollaud

This issue is negating the core function of the gem, because locks are disappearing. Upgrading to 8 isn't really a viable option because other middlewares haven't fully caught up to the latest Sidekiq yet. Looking forward to the backport.

I know! I’ll try to sort it out this week. It has been a crazy year, hard to get to the open source work.

Prioritizing this today and tomorrow.

mhenrixon avatar Jun 29 '23 06:06 mhenrixon

@mhenrixon This should be fixed as of #799 and v7.1.30, right?

rafaeelaudibert avatar Jul 19 '23 20:07 rafaeelaudibert

@mhenrixon This should be fixed as of #799 and v7.1.30, right?

Yes sir

mhenrixon avatar Jul 20 '23 05:07 mhenrixon

Versions: sidekiq 7.1.5, sidekiq-cron 1.9.1, sidekiq sidekiq-unique-jobs 8.0.3

I'm experiencing a similar issue. I have a worker FetchAssetValueWorker (lock: :until_executed, on_conflict: :reject) that runs every 10 minutes (scheduled using sidekiq-cron). Under specific circumstances, this worker may take more than 30 minutes to run. Everytime this worker is added to the dead queue for being rejected, for some reason it's also removed from redis (through delete_by_digest command):

2023-10-10T02:00:13.770Z pid=1 tid=hs1 uniquejobs=client until_executed=uniquejobs:bc7b7713595a5fb14e4ed3456bfd9049 INFO: Adding dead FetchAssetValueWorker job 83d9a5a3e96f64bbd7a920db
2023-10-10T02:00:13.775Z pid=1 tid=hs1 uniquejobs=client until_executed=uniquejobs:bc7b7713595a5fb14e4ed3456bfd9049 INFO: delete_by_digest(uniquejobs:bc7b7713595a5fb14e4ed3456bfd9049) completed in 3ms

Because of this, the worker will be duplicated if it takes more than 20 minutes to complete. I didn't see anything about this behavior in the documentation. Is this expected behavior or is it a bug? And how can I circumvent this?

democlitos avatar Oct 10 '23 02:10 democlitos

@democlitos I have a few reaper improvements on the main branch right now. It is only released once it is verified by someone that it is working better.

#830 amongst others

mhenrixon avatar Feb 07 '24 10:02 mhenrixon