php-resque
php-resque copied to clipboard
Loss of data/jobs with no possibility of exceptions or logs
I am working on an application that is processing ~25k jobs per day, and we have started to notice that some jobs are not being processed or queued, more problematic is that these failed (or missing) jobs have no corresponding log entries in Symfony (via https://github.com/michelsalib/BCCResqueBundle), php-fpm, nginx or redis.
After some investigation, I believe this is being caused by the CredisException
suppression in Resque_Redis
.
Resque_Redis
and Credis_Client
Inside Resque_Redis
all methods calls are handled magically by __call
and supported methods are proxied to Credis_Client::__call
.
Credis_Client::_call
has a few failure states represented in part by a CredisException
, this can happen during connection to redis (Credis_Client::connect
) or during the call to the corresponding redis method.
If a CredisException
is raised, Resque_Redis::_call
will return false
, this is less than ideal (as it is harder for users of this library to detect/deal with redis errors). However it is ok as long as all upstream usages of Resque_Redis::_call
check for a false
return value.
Unfortunately this brings me to my specific issue of losing data in Resque::enqueue
. Resque::enqueue
calls Resque_Job::create
which in turn calls Resque::push
.
Resque::push()
public static function push($queue, $item)
{
self::redis()->sadd('queues', $queue);
$length = self::redis()->rpush('queue:' . $queue, json_encode($item));
if ($length < 1) {
return false;
}
return true;
}
The variable $length
is confusing/misleading here, if there is an exception raised by redis/credis during this call $length
is actually false
and is indicative of a failure state due to an exception (rather than an incorrect queue length). Again this is ok if the usage of this function checks for a false
return value which at this point could indicate an exception, or an erroneously empty queue.
Resque_Job::create()
If we remove the code that deals with job status/input validation we have
public static function create($queue, $class, $args = null, $monitor = false)
{
$id = md5(uniqid('', true));
Resque::push($queue, array(
'class' => $class,
'args' => array($args),
'id' => $id,
));
return $id;
}
Clearly we fail to check the return value of Resque::push()
here. Job::create
is the only usage of Resque::push()
. This means if we experience any kind of Redis error, or Credis error during Job creation the end user will be none the wiser.
Solutions?
- Raise an exception in Resque::push if
Resque_Redis::rpush
returnsfalse
- Remove suppression of
CredisException
insideResque_Redis
The only reason I have hesitated to open a pull request, to at the very least remove the exception suppression is that I do not know if it is there for a specific reason/use case for it being there.
Notes.
Any one evaluating the use of this library in their application should be very wary it due to the possibility of silent data loss as of https://github.com/chrisboulton/php-resque/commit/c335bc35558e0683f95c3fd5dd11a5817c0a761f (or v1.2)
There's a patch around someplace to return false instead of the generated job ID if job creation fails, but this project has been stalled for about a year, so it hasn't been pulled in. Would exceptions be better? Perhaps, but they wouldn't be consistent with the majority of how the project interfaces with external code. Either way, at the moment I'd feel better recommending illuminate/queue for queue management (the Redis driver operates very similarly to how PHP-Resque does) if you're looking for something actively maintained.
@danhunsaker thank you for the speedy reply. Do you have contribution rights to this repo or is it only @chrisboulton?
It would be really useful (for future teams/users) to mark this repo as abandoned if possible to avoid others running into this issue with no hope of resolution (save for an existing or new fork)
http://seld.be/notes/composer-1-0-alpha9
It's not abandoned, just yet. Just stalled for a while. I believe Chris was planning to open push access to the repo at some point soon, when he has time to properly review applicants and get the selected individuals up to speed on the direction the project is intended to move forward. At such time, progress will likely leap forward, and PHP-Resque will again become a powerful contender in the queue engine landscape.
Ah ok, thanks for the update. In the meantime is there anything we can do (other than this issue) to make users aware of this issue (and the potential lengthy timeline for a fix).
Users who do not do their due diligence when assessing packages for inclusion in their projects will not likely be aware of the problem here - and when (google) searching for background queuing solutions this lib (and the symfony bundle) are high up if not first in the results.
@peterjmit Nice work! +1! Thanks for sharing, I always wondered where that problem came from... :-)
Thanks for spending the time tracking this down - I'm definitely eager to get something in place for this.
I feel like surfacing an exception is the correct solution here - do we want to refactor everything to surface something like Resque_RedisException
whenever an exception related to the Redis connection appears?
Unless there's a chance we'd have a non-connection Redis exception to raise. If there is, I'd call it Resque_RedisConnectionException
. It might also be good to attempt to reconnect and retry the requested command first, only throwing the exception if that fails.
It would also be good, at that point, to raise exceptions on other errors we currently do not, for consistency's sake. This is a non-BC change, though, so versioning will be affected. Returning false instead of raising an exception would be a decent patch for 1.2.1, while the exception would be better overall, but require tagging under 1.3.
Credis doesn't really seem to differentiate between what is connection related and what's just normal error related - looks like CredisException
is thrown regardless so we probably can't differentiate either (hence the generic name)
Sounds good, then. :-)
@danhunsaker @chrisboulton We deployed a similar fix (which has typically yet to report any errors since we looked into it) https://github.com/peterjmit/php-resque/commit/e11af37c2e98ad64f10a6c62928f6085e5cb971d. Having an exception specific to redis is definitely an improvement
@peterjmit can you pls clarify if your issue was that there were jobs that was not picked by threads and were you able to fix it. I am facing a similar situation where if i q 30 jobs to be processed in default q some where around 20 - 25 gers picked up and processed and remaining just doesnt get picked up at all and the q state probably is cold
@raajeshnew we have not actually tracked down the source of the issue, and we are still facing problems with losing data - your experience is good to know (and will hopefully help us figure things out!)