sleeper icon indicating copy to clipboard operation
sleeper copied to clipboard

Compaction task fails entirely when an exception is thrown during a job

Open patchwork01 opened this issue 7 months ago • 1 comments

Description

When an exception is thrown during a compaction job, the task fails completely and terminates. The job stays on the queue and is retried when the message visibility timeout runs out, which is 15 minutes by default.

If a job fails completely and is not returned to the queue (eg. because it's sent to the dead letter queue), the files will never be compacted, since they're still assigned to that job.

Expected behaviour

A compaction job failing should not prevent a compaction task from continuing to process jobs.

A job which fails should be released back to the compaction job queue to be retried.

It will then automatically be moved to the dead letter queue if it has been retried too many times (this is built-in behaviour in SQS given that we've configured a dead letter queue). If a job ends up on the dead letter queue, the files can be left assigned to that job until a human deals with it.

Background

This is also related to a separate issue where if a compaction job fails its state store update, the file will never be compacted:

  • https://github.com/gchq/sleeper/issues/1412

patchwork01 avatar Nov 14 '23 15:11 patchwork01

A job which fails should either be retried, or should result in the files being freed up to be assigned to another compaction job.

I'm not sure that if a job fails a few times we should free the files up to be assigned to another compaction job. If for example the compaction job is failing because one of the files is malformed then this will just propagate the problem indefinitely. Or imagine that an iterator fails if a field takes a certain value - a file containing that value can never be compacted so we don't want it to end up in another job as it still won't work. I'd suggest we try a job N times and then if it still fails, raise that as an issue for a human to investigate. We already try a job multiple times if it fails before it eventually ends up on the dead-letter queue. The intention is that messages on the dead-letter queue should be investigated manually. We could surface dead-letters to a SNS queue to make it more obvious that something needs looking at.

gaffer01 avatar Nov 14 '23 16:11 gaffer01