good_job
good_job copied to clipboard
Race condition when running multiple processes blocked by `after_commit` hook
We have a job that generates a large file and then uploads it to ActiveStorage
. ActiveStorage
uses an after_commit
hook to process the upload after the transaction was committed. Our code looks something like this:
class Parent < ActiveRecord::Base
end
class Model < ActiveRecord::Base
has_one :parent
has_one_attached :large_file, dependent: :destroy
aasm do
state :new
state :processing
event :start_processing do
transitions from: :new, to: :processing
end
end
end
class GenerateLargeFileJob < ActiveJob::Base
def perform(model)
return if model.processing?
ActiveRecord::Base.transaction do
model.start_processing!
model.parent.update!(some: 'values')
io = generate_large_file_content # takes up to 1,5 hours
model.large_file.attach(io:, filename: 'large-file.zip', content_type: 'application/zip')
end
# upload happens "here" in after_commit hook, can take a few minutes.
end
end
If we run this with multiple processes, we find that apparently with the transaction being committed the advisory lock is already released, but the job might not be marked as finished yet. The upload is happening right after the transaction commit and can take a few minutes to complete. During that time, we saw another good_job process picking up the job again attempting to process it.
In our specific use case this second job will exit immediately with success due to the sanity check for processing?
. This is a bit over-simplified but basically that's what's happening. Of course one could ask whether we could do better error handling here, but that's not the point here. The point is that good_job is trying to work on a job that is still in progress on another processor!
The first job will eventually finish too and then raise an error:
WARNING: you don't own a lock of type ExclusiveLock
Is there any way around this? I think with sidekiq we did not see this problem because the locking was not directly tied to the Postgres Transaction, and so the lock would have only been released after the upload finished.
For now, we will disable our autoscaling rules to avoid such race conditions, but we can't keep it that way as it will severely degrade the performance during peak times, where we usually want to start up to 5 background workers to clear queues quickly.
That's not good!
Everything you've described sounds like the database connection that holds the advisory lock is being disconnected/closed. (nothing to do with transactions or after_commit
)
GoodJob uses session/connection-based Advisory Locks (not transaction-based advisory locks). So I don't think the problem is the transaction, but rather that somewhere the database connection that holds the advisory lock is being disconnected/closed/broken. The signs of that are:
- The job is being executed again by a different thread, which is a sign the connection has specifically been closed (rather than returned to the connection pool).
- The "WARNING: you don't own a lock of type ExclusiveLock" which is what happens when a new/different connection is checked-out and the code tries to release an advisory lock it didn't acquire.
Are you seeing this happen on every one of these long jobs, or just sometimes?
I'm wondering if this is like:
- Occasionally, a sign that there is some specific event, like the database hiccuping or a network applicance restarting, causing it. Or...
- Every job, which is a sign that configuration/something somewhere is intentionally closing long-lived database connections.
Thanks for your detailed answer. We don't know yet whether this happens on a regular basis. The thing is, this process does not run very often, and further testing will use up many resources at a third party provider, so we will need to monitor this closely over a longer period of time. For now we will not get new data as we disabled autoscaling to be safe. In the meanwhile, I will experiment further and try to come up with a minimal reproducible example.