spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Intermittent `FileNotFoundException` for Prometheus Configuration in Spark Driver Pods

Open potlurip opened this issue 1 year ago • 0 comments

I am experiencing an intermittent issue with the Spark Operator's monitoring feature, specifically when it's configured to expose metrics to Prometheus. Occasionally, the Spark driver fails to start due to a FileNotFoundException related to the Prometheus configuration file. The error indicates that the /etc/metrics/conf/prometheus.yaml file is not found, even though the relevant ConfigMap (spark-pi-test-prom-conf) exists within the cluster. This issue does not occur on every application deployment but happens occasionally.

Environment

  • Spark Operator Version: latest version
  • Kubernetes Version: 1,25
  • Cloud Provider: AWS - EKS
  • Installation Method: Helm chart

Configuration

Monitoring is enabled with the following configuration in the Spark Operator, aiming to expose both driver and executor metrics to Prometheus:

monitoring:
  exposeDriverMetrics: true
  exposeExecutorMetrics: true
  prometheus:
    jmxExporterJar: "/prometheus/jmx_prometheus_javaagent.jar"
    port: 8090

Error Observed

Caused by: java.io.FileNotFoundException: /etc/metrics/conf/prometheus.yaml (No such file or directory)

Steps to Reproduce

  1. Enable monitoring in the Spark Operator with Prometheus metrics exposure as described above.
  2. Deploy a Spark application managed by the Spark Operator.
  3. Observe the startup logs of the Spark driver pod for the intermittent FileNotFoundException related to the Prometheus configuration.

Expected Behavior

The Spark driver and executor pods should successfully mount the Prometheus configuration from the spark-pi-test-prom-conf ConfigMap, start without errors, and expose metrics to Prometheus on port 8090.

Actual Behavior

Intermittently, the Spark driver pod fails to start due to a FileNotFoundException for /etc/metrics/conf/prometheus.yaml. This suggests that the spark-pi-test-prom-conf ConfigMap is not being consistently mounted to the /etc/metrics/conf directory in the driver pod.

Troubleshooting Steps Undertaken

  • Verified the existence and correctness of the spark-pi-test-prom-conf ConfigMap in the Kubernetes cluster.
  • Checked the Spark Operator and driver pod logs for any errors or warnings related to ConfigMap mounting or Prometheus configuration.

I am seeking insights in resolving this intermittent issue with the Prometheus configuration file not being found in the Spark driver pods. Any guidance on further troubleshooting steps, potential causes, or known solutions to ensure consistent mounting of the Prometheus configuration would be greatly appreciated.

Additional Information

  • I have opened another issue titled "Intermittent Sidecar Injection Failure for SparkApplication Resources," which shares similarities with this Prometheus configuration problem. However, I am unclear if both issues are due to the same underlying cause.

Issue link - https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1920

potlurip avatar Feb 15 '24 03:02 potlurip