Failed cleanups aren't actionable

Open adejanovski opened this issue 2 months ago • 1 comments

What is missing?

After a scale out operation, cleanup will be triggered using a CassandraTask. If cleanup fails on one of the database pods, it won't get retried and is only reported in the CassandraTask status. This doesn't allow proper alerting as there's no way to convert this into a metric. CassandraTasks get deleted quite quickly after completion as well, so there's a timing after which we cannot know if the last cleanup has succeeded. I think it's not possible either to know which pod has failed the task, we only have counters.

The first thing we need is a metric with the number of failed pods for a CassandraTask, using the task type as a label.

We should also consider variable TTLs depending on the task to allow longer time for inspection, and providing the list of pod names that have failed the task. A failed cleanup can have quite a lot of consequences and we should consider a retry policy as sstables which have been cleaned up already get skipped, which doesn't create overhead.

On large clusters, the sequential execution of cleanup is too slow, which delays the ability to perform another scaling operation (depending on the chosen strategy, but still). We need to have options to specify the number of compactors used by the cleanup task, and also offer the ability to run cleanup on all nodes in a rack at a time.

Why is this needed?

The way CassandraTasks, and specifically the cleanup one, work creates some operational challenges. We need to improve the production experience and provide proper monitoring for all these features along with more options to tune the pressure of distributed tasks.

Oct 09 '25 08:10 adejanovski

CassandraTasks get deleted by default after 24 hours, so it's not that short (and one can modify the time obviously).

Oct 09 '25 08:10 burmanm