spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Missing metrics after upgrading to Spark 3.0

Open teddyhartanto opened this issue 5 years ago • 6 comments

Hi,

We upgraded our spark cluster to 3.0 a while ago and we realised that a few of the metrics that we were tracking are now no longer being exported (from both the Spark Operator & Spark executors). In particular, we're looking at these metrics:

spark_app_executor_failure_count
spark_executor_threadpool_completetasks
spark_executor_shufflebyteswritten_count
spark_executor_shuffleremotebytesread_count
spark_executor_shuffleremotebytesreadtodisk_count
spark_executor_diskbytesspilled_count
spark_executor_threadpool_completetasks
spark_executor_threadpool_activetasks

In our SparkApplication spec we have this:

  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: "/opt/prometheus/jmx_prometheus_javaagent-0.12.0.jar"
      port: 8090

We're using gcr.io/spark-operator/spark-operator:v1beta2-1.1.2-3.0.0 for our spark operator image.

Has anybody faced the same issue?

teddyhartanto avatar Dec 15 '20 09:12 teddyhartanto

Did you find an answer to your question? If so, do you mind sharing your insights? Thanks!

mcd01 avatar Mar 23 '21 07:03 mcd01

Hi @mcd01, no answer yet unfortunately

teddyhartanto avatar Mar 29 '21 10:03 teddyhartanto

@TeddyHartanto I ran into this as well. I found an interesting PR for the prometheus JMX exporter which creates an example for spark3. What stuck out to me was that this was needed in the first place 😛 . A simple diff between the spark.yaml and spark-3-0.yaml shows that the pattern matchers have to add a , type=gauges or , type=counters clause. Applying similar changes to the prometheus config that is made available by default by spark-operator, and I am finally able to get all my metrics! It's a bit unfortunate to have to configure this externally, but it is an option, as you can embed the prometheus yaml directly as .spec.monitoring.prometheus.configuration or make the config available in your container and specify it as .spec.monitoring.prometheus.configFile.

The prometheus config that has been modified to provide all the same metrics under spark3 is available in this gist. Hope it helps!

@liyinan926 do you think we could get this incorporated into the project, since many metrics are missing with the current default configs under spark3?

Disclaimer: I haven't done all due diligence here to provide the optimal config. Spark 3 may have some additional patterns that could be leveraged. Furthermore, there could be some things pruned or tweaked, maybe even to the extent that the same config could be used for both spark 2 & 3. I haven't done all the work here. But the simplest solution got me a long way with my missing metrics problem, so that's what's here.

srstrickland avatar Aug 24 '21 19:08 srstrickland

We also come acorss such problem if we use Spark-3.2.0 to test structured streaming, and the inputRate related metrics are totally missed.

Myasuka avatar Jan 26 '22 06:01 Myasuka

  "expr": "spark_executor_threadpool_completetasks{pod=~\".*$exec.*\"}",
 @srstrickland   can you pls help me what wrong i did for getting executor completed tasks ?

sai261308 avatar Nov 09 '22 08:11 sai261308

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 14 '24 04:10 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Nov 03 '24 06:11 github-actions[bot]