delayed_job
delayed_job copied to clipboard
Worker crashes when error.message is too long
I had an eager load statement that caused a very long SQL statement. It was longer than 16M, the default max_pakcet_size
of MySQL, thus the server rejected it with an error. When delayed_job tried to save the error message
job.last_error = "#{error.message}\n#{error.backtrace.join("\n")}"
This error.message also contains the SQL statement, so the server again rejects this. The worker then shuts down. And then all other workers on all our servers will follow, as delayed_jobs.attempts
and run_at
are not increased.
As delayed_jobs.last_error
- in the MySQL implementation - is a TEXT
and can only hold 64K anyway, I would suggest to simply limit the number of characters that are stored:
job.last_error = "#{error.message[0 ... 50 * 1024]}\n#{error.backtrace.join("\n")}"
With 50K there is still room for the stack trace and I doubt that after 50K of error message there is anything new that would be missed.
The above line would have to be patched in Delayed::Worker.run and handle_failed_job.
We ran into this issue as well. Noticed what the cause of the error was today.
The job fails and then DJ tries to save the entire error message which includes a long SQL query. This results in an error with MySQL Mysql2::Error: Data too long for column 'last_error'
The worker then dies. Another worker comes along and picks up this job, and dies in the same way. This happens until each worker is dead. :-(
Would be nice for DJ to truncate the last_error message to keep from failing in this way, as @jan suggested.
Unfortunately there is no reasonable way for us to determine what the value needs to be truncated to. Different servers have different limits, and those limits are also configurable. The limits get even weirder in mysql if the table is configured as utf8-mb4. All that to say, we tried, and those weren't the only problems.
Thanks for the update @albus522. Glad to hear you've considered this already. Would you have a workaround to suggest?
Something like wrapping the code in our job with a catch block and then re-raising the exception after truncating the error message to what we know is a reasonable value?
This is not elegant, but placing this monkey patch in an Rails initializer sidesteps the limitation of the last_error column:
module Delayed
module Backend
module Base
def last_error=(err_message)
col_limit = self.column_for_attribute(:last_error).limit || 65535
write_attribute :last_error, err_message.truncate(col_limit, omission: " ... (error message too long)")
end
end
end
end
Is column_for_attribute(:last_error).limit
reliable for different character sets? Wonkiness in this arena must be what @albus522 was alluding to.
The other obvious workaround that comes to mind is to change the column type of the last_error
column to LONGTEXT or similar to take more room. Admittedly, this potentially kicks the issue down the road.
Thanks @mccolin for the reply. The approach we took was to change the column type to be able to store longer strings. I think we used MEDIUMTEXT
since 16MB is still quite long. It's potentially kicking the issue down the road, but was a relatively easy way to alleviate the nasty kills-all-workers issue we were running into.
I ran into this issue today on the (somewhat legacy app) I work on. The issue was that the mysql database wasn't configured for utf8mb4
strings, and the exception that caused the job to fail was because of an invalid encoding Mysql2::Error: Incorrect string value....
error. Unfortunately, the message which was then going to be put into the last_error
column also caused the same Mysql2::Error encoding error and the job went into an infinite loop. Fortunately it was caught in testing, but, would have been pretty bad if it had made it into production.
It may be a halting problem, but, would it be possible to detect these failures and rescue them somehow? (or at minimum escalate them). The real issue here for me was that the job was spinning exceptions weren't thrown in a way that that could be seen.
I ended up fixing it by stripping out the 4 byte chars in a monkey patch like in https://github.com/collectiveidea/delayed_job/issues/550#issuecomment-253703623 , but yes, not the nicest.
Ok, this is already captured in https://github.com/collectiveidea/delayed_job/issues/774
Here's my solution using the active record gem and before_save callback:
- config/initializers/monkey_patches/delayed_job.rb
# This is a monkey patch to limit the length of the last_error column in the delayed_jobs table
# to prevent this error: FATAL Mysql2::Error: Data too long for column 'last_error' at row 1
# https://github.com/collectiveidea/delayed_job/issues/550
module Delayed
module Backend
module ActiveRecord
class Job
before_save :truncate_last_error, if: :last_error?
def truncate_last_error
if last_error.present?
col_limit = self.column_for_attribute(:last_error).limit || 65535
self.last_error = last_error.truncate(col_limit, omission: ' ... (error message too long)')
end
end
end
end
end
end