spark-operator
spark-operator copied to clipboard
Intermittent `FileNotFoundException` for Prometheus Configuration in Spark Driver Pods
I am experiencing an intermittent issue with the Spark Operator's monitoring feature, specifically when it's configured to expose metrics to Prometheus. Occasionally, the Spark driver fails to start due to a FileNotFoundException
related to the Prometheus configuration file. The error indicates that the /etc/metrics/conf/prometheus.yaml
file is not found, even though the relevant ConfigMap (spark-pi-test-prom-conf
) exists within the cluster. This issue does not occur on every application deployment but happens occasionally.
Environment
- Spark Operator Version: latest version
- Kubernetes Version: 1,25
- Cloud Provider: AWS - EKS
- Installation Method: Helm chart
Configuration
Monitoring is enabled with the following configuration in the Spark Operator, aiming to expose both driver and executor metrics to Prometheus:
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent.jar"
port: 8090
Error Observed
Caused by: java.io.FileNotFoundException: /etc/metrics/conf/prometheus.yaml (No such file or directory)
Steps to Reproduce
- Enable monitoring in the Spark Operator with Prometheus metrics exposure as described above.
- Deploy a Spark application managed by the Spark Operator.
- Observe the startup logs of the Spark driver pod for the intermittent
FileNotFoundException
related to the Prometheus configuration.
Expected Behavior
The Spark driver and executor pods should successfully mount the Prometheus configuration from the spark-pi-test-prom-conf
ConfigMap, start without errors, and expose metrics to Prometheus on port 8090.
Actual Behavior
Intermittently, the Spark driver pod fails to start due to a FileNotFoundException
for /etc/metrics/conf/prometheus.yaml
. This suggests that the spark-pi-test-prom-conf
ConfigMap is not being consistently mounted to the /etc/metrics/conf
directory in the driver pod.
Troubleshooting Steps Undertaken
- Verified the existence and correctness of the
spark-pi-test-prom-conf
ConfigMap in the Kubernetes cluster. - Checked the Spark Operator and driver pod logs for any errors or warnings related to ConfigMap mounting or Prometheus configuration.
I am seeking insights in resolving this intermittent issue with the Prometheus configuration file not being found in the Spark driver pods. Any guidance on further troubleshooting steps, potential causes, or known solutions to ensure consistent mounting of the Prometheus configuration would be greatly appreciated.
Additional Information
- I have opened another issue titled "Intermittent Sidecar Injection Failure for SparkApplication Resources," which shares similarities with this Prometheus configuration problem. However, I am unclear if both issues are due to the same underlying cause.
Issue link - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1920