"Job not found" errors
I intermittently get errors like this that stop the worker from processing other jobs:
16:19:38.904 [warn] [faktory] fail failure: Job not found a82e98fb13746eca74755f5d -- retrying in 32.138s
I suspect it's because I have a large backlog of jobs and weird things happen when I deploy new versions of the worker app (I mean new versions of my code, not new versions of deps) and it has to reconnect to the maanger.
So how can a job not be found? I thought there was only one canonical reference to a job. And why should such an error completely stop the worker from doing anything else even though its concurrency is set to 30 or so?
I'm using the Faktory Docker image, the faktory binary there reports version 1.0.1. I'm using the latest faktory_worker_ex client, v0.7.0.
Job not found is returned if a worker calls FAIL <jid> for a job but Faktory does not have an existing reservation for that JID. That can happen if the job takes longer that the reservation timeout, which is 30 minutes by default, so the reservation expired and was garbage collected.
Are there circumstances in which a job can be perceived to have taken longer than 30 minutes? Of course there's a chance that the job itself takes that long, say if it's doing something very hard, but I'm wondering if somehow jobs might be seen to have taken a long time if I stop a queue for DB maintenance or something.
But of course the main point is still that the worker is just choking on this error and not doing anything else.
Hi @tombh. I was not aware that the FAIL command could return "Job not found"; there is not much documentation on the Faktory protocol: https://github.com/contribsys/faktory/wiki/Worker-Lifecycle#report-result . So I thought there are only two outcomes: "OK" or a network error.
It would be super nice to have a "protocol" wiki page for all the commands!
@mperham Thanks for the explanation! I wouldn't mind taking a stab at writing up the protocol docs if you could link me the Go code.
@tombh, as a quick fix, I'll publish 0.7.2 tonight which handles that case (and I guess just logs a warning for when that happens). But if you don't think that you're jobs aren't taking over 30 mins, then it's a bit worrisome.
@cjbottaro Faktory uses the RESP protocol. You should handle any -ERR response as a protocol error.
https://redis.io/topics/protocol#resp-errors
-ERR indicates a generic error, there can be more specific error codes which indicate specific conditions which the worker might want to respond to, one example is -NOTUNIQUE when pushing a new job which violates Faktory Pro's job uniqueness feature.
Awesome, thank you!
@tombh Can you try out the fail-and-ack-errors branch? It's kind of hard to test this since Elixir's mocking and stubbing isn't as loose as some other languages. The current test suite passes on that branch though.
Sure! I've got a queue of 1.9 million jobs, I'll get your branch deployed and leave it running all day.
Sure! I've got a queue of 1.9 million jobs, I'll get your branch deployed and leave it running all day.
Wow, that's a lot of jobs.
And why should such an error completely stop the worker from doing anything else even though its concurrency is set to 30 or so?
For posterity, it's because I was not aware that Faktory's protocol was RESP.
The code assumes any error talking to the Faktory server is due to a networking issue. And since the the code uses Connection, it also assumes the connection will self heal, and thus it retries any failed communication with the Faktory server indefinitely (with capped exponential backoff).
Ah I see that makes sense. Thanks so much for the quick fix.
I've got it deployed now and its handled a few thousand jobs already without problem. I'd say there were about 4 of these "Job not found" errors in the last 24 hours, so by this time tomorrow we should know if your branch is a good fix.
I forgot to answer this:
But if you don't think that you're jobs aren't taking over 30 mins, then it's a bit worrisome
I haven't looked into this closely, but out of all these jobs running, all involving HTTP requests, it's as good as certain then that the 30 minute limits are being hit at some point.
Ok so after 24 hours and a few 100,000s of jobs I haven't seen any problems :)