cassandra-data-migrator icon indicating copy to clipboard operation
cassandra-data-migrator copied to clipboard

Spark jobs success/fail status in the UI/metrics

Open Skunnyk opened this issue 8 months ago • 3 comments

Hi :-)

I know nothing about spark jobs by themselves, but I run CDM in standalone mode, and spark UI is available on port 4040.

I can see the job (numParts) running, and a summary of all "SUCCESS" jobs in the WebUI, even when the job are "FAIL" for CDM. Is it expected ?

I enabled the spark prometheus metrics (in spark-3.5.5-bin-hadoop3-scala2.13/conf/metrics.properties) to be able to follow the success/failed jobs, but as they are based on the same information, everything is "success" :)

I can still follow the trackRun table for FAIL, but I wonder if there is another way.

Thank you,

Skunnyk avatar Apr 24 '25 10:04 Skunnyk

Thanks for your question @Skunnyk! What were you looking to find out using the Spark UI here during the migration?

msmygit avatar Apr 29 '25 16:04 msmygit

Hi @Skunnyk,

CDM tracks & handles failures internally within each parallelized Spark task. Hence the Spark UI will report everything as SUCCESS because from Spark's perspective, the tasks complete without errors.

Spark only tracks the overall status of tasks (e.g., successful, failed, or running), whereas CDM tasks have its own detailed life-cycle (NOT_STARTED, STARTED, PASS, FAIL, DIFF, DIFF_CORRECTED, ENDED). We do not allow failures to quit the tasks abruptly as there are other reporting/cleanup actions that happens even after a failure.

We may be able to tweak the app to report failures to the Spark UI (although we would prefer not to), but using Spark UI for reporting was never the plan. Our recommended way to track/monitor the jobs is via the trackrun feature.

pravinbhat avatar Apr 29 '25 17:04 pravinbhat

Hi, thank you for your answers.

@msmygit: I wanted to use the generated prometheus metrics from the spark process to be able to follow/graph the success/failed tasks because the migration processes will run for a couple of days :-)

@pravinbhat Ok that was what I thought, thank you.

CDM logs/output are a bit hard to follow (this can be improved with some log4j configuration I guess), and with the trackRun feature, it can be hard to see where we are at when multi run/previousId are done for a big table with failed tasks.

Skunnyk avatar Apr 30 '25 09:04 Skunnyk