spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Spark Metrics show failed status when the actual final state of the spark application is completed

Open amit-mazor opened this issue 3 years ago • 1 comments

Separate the spark_app_failure_count metric to "failed sparkapp" and "retries of sparkapp"

As Im monitoring my scheduled spark applications, I'v noticed that there is no separation between:

  • Spark application that failed completely, meaning its final status is FAILED
  • failed submit, which after a retry, completed successfully.

Currently, the metric spark_app_failure_count increments on each failed event, including retries, which doesn't allow to monitor retries and failed jobs separately.

I think that retries is something that is really important to monitor, since it points on something that is wrong in my job, but it shouldn't be monitored together with failed jobs, but in a separated metric.

By having two separated metrics, we will be able to monitor failed jobs that their final status is 'FAILED', and to monitor retries which can tell us that something is wrong, but still allow the job to complete successfully.

spark_app_executor_failure_count metric increments also when the sparkapp completed successfully:

Since Im using spot instances, and my nodes are running in auto scaling groups, it can take some time until all of the executor pods get their machines. Because of that, some pods may take a bit longer than the others to go up, and in the meantime, the pods that already started, are executing the tasks. Until all of the pods are going up, the spark application is completed, and the driver's temp svc is deleted. Then the rest of the executors that took longer to start, cant find the driver's svc, and the metric of the 'executors_failed' is incremented.

I think that the 'executors failed' shouldn't be incremented if the spark application was completed successfully, and should be incremented only when the spark application is failing and cant be completed because of that. This will tell us that there is an actual error in the executors, which actually prevents from the spark job to be completed.

Add a reason label to "spark_app_failure_count" metric

Currently, Im seeing the spark_app_failure_count count increments either when the driver is failing, the excutors are failing to complete their task, and retries. So if I want to get an alert on a failed sparkapplication, I need to dive in and look for the the reason of the alert, to check the driver pod, the executors pods. I think it will be very helpful to see the reason as part of the metric labels.

amit-mazor avatar Jun 27 '22 07:06 amit-mazor

Updated the issue, added two additional metrics issues

amit-mazor avatar Jun 27 '22 11:06 amit-mazor