Missing metrics after upgrading to Spark 3.0
Hi,
We upgraded our spark cluster to 3.0 a while ago and we realised that a few of the metrics that we were tracking are now no longer being exported (from both the Spark Operator & Spark executors). In particular, we're looking at these metrics:
spark_app_executor_failure_count
spark_executor_threadpool_completetasks
spark_executor_shufflebyteswritten_count
spark_executor_shuffleremotebytesread_count
spark_executor_shuffleremotebytesreadtodisk_count
spark_executor_diskbytesspilled_count
spark_executor_threadpool_completetasks
spark_executor_threadpool_activetasks
In our SparkApplication spec we have this:
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/opt/prometheus/jmx_prometheus_javaagent-0.12.0.jar"
port: 8090
We're using gcr.io/spark-operator/spark-operator:v1beta2-1.1.2-3.0.0 for our spark operator image.
Has anybody faced the same issue?
Did you find an answer to your question? If so, do you mind sharing your insights? Thanks!
Hi @mcd01, no answer yet unfortunately
@TeddyHartanto I ran into this as well. I found an interesting PR for the prometheus JMX exporter which creates an example for spark3. What stuck out to me was that this was needed in the first place 😛 . A simple diff between the spark.yaml and spark-3-0.yaml shows that the pattern matchers have to add a , type=gauges or , type=counters clause. Applying similar changes to the prometheus config that is made available by default by spark-operator, and I am finally able to get all my metrics! It's a bit unfortunate to have to configure this externally, but it is an option, as you can embed the prometheus yaml directly as .spec.monitoring.prometheus.configuration or make the config available in your container and specify it as .spec.monitoring.prometheus.configFile.
The prometheus config that has been modified to provide all the same metrics under spark3 is available in this gist. Hope it helps!
@liyinan926 do you think we could get this incorporated into the project, since many metrics are missing with the current default configs under spark3?
Disclaimer: I haven't done all due diligence here to provide the optimal config. Spark 3 may have some additional patterns that could be leveraged. Furthermore, there could be some things pruned or tweaked, maybe even to the extent that the same config could be used for both spark 2 & 3. I haven't done all the work here. But the simplest solution got me a long way with my missing metrics problem, so that's what's here.
We also come acorss such problem if we use Spark-3.2.0 to test structured streaming, and the inputRate related metrics are totally missed.
"expr": "spark_executor_threadpool_completetasks{pod=~\".*$exec.*\"}",
@srstrickland can you pls help me what wrong i did for getting executor completed tasks ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.