Unable to Use Environment Variable and Custom Configuration in Spark Operator for Spark Application
Issue Description:
I'm encountering an issue where I'm unable to utilize an environment variable that contains a password within my Spark application when deploying with the Spark Operator.
Approaches Taken:
Environment Variable Approach:
-
Using Environment Variable in Spark Configuration:
- I tried to pass the password as an environment variable and then attempted to access this variable within the Spark configuration.
- Example:
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: example-spark-app namespace: default spec: ... driver: env: - name: PASSWORD valueFrom: secretKeyRef: name: my-secret key: password executor: env: - name: PASSWORD valueFrom: secretKeyRef: name: my-secret key: password sparkConf: "spark.myapp.password": "$(PASSWORD)" ... - The Spark application was unable to resolve and use the password in the configuration.
Alternative Approaches:
-
Adding
spark-defaults.confin/opt/spark/conf:- I added a
spark-defaults.conffile in the/opt/spark/confdirectory with a single property for the password using init-container. - Example content:
spark.myapp.password=my_password_value - The Spark Operator is overwritting the conf dir.
- I added a
-
Using Spark ConfigMap:
- I created a ConfigMap containing the Spark configuration and specified it using
spec.sparkConfigMap. - Example ConfigMap:
apiVersion: v1 kind: ConfigMap metadata: name: my-spark-config namespace: default data: spark-defaults.conf: | spark.myapp.password=my_password_value - The ConfigMap successfully created a configuration file in the
/etc/spark/confdirectory, but the application continued to read configurations from/opt/spark/conf/spark.properties, ignoring the file generated by the ConfigMap.
- I created a ConfigMap containing the Spark configuration and specified it using
Expected Behavior:
- The Spark application should be able to read and use the environment variable in the Spark configuration.
- Alternatively, it should be able to read the configurations from the ConfigMap file in the
/etc/spark/confdirectory.
Actual Behavior:
- The Spark application ignores the environment variable set in the configuration.
- The application ignores the configuration file created by the ConfigMap in
/etc/spark/conf.
Potential Solution:
I considered the option of creating an additional Spark configuration file and configuring the Spark application to use both configuration files. However, I'm unable to find a way to achieve this within the Spark Operator setup.
@potlurip Could you provide more detailed information e.g the helm chart version and how you install the chart? Did you enable the webhook server?
Helm Chart Version:
- Version: v1beta2-1.3.8-3.1.1
- The Spark Operator was installed using Helm with the following command:
helm install
Webhook Server:
- Yes, the webhook server was enabled.
Kubernetes Version:
- Version: 1.26
Spark operator internally calls spark-submit, and it should be noted that using environment variables in Spark conf is not supported by spark-submit. One possible approach is to hard-code the application password into the spec.sparkConf, although this method is not secure. Alternatively, you could modify your application to fetch the password from environment variables, which is a more secure practice.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm facing the exact same problem. I even tried a much simpler approach:
driver:
cores: 1
memory: 512m
labels:
version: 3.5.3
serviceAccount: spark-operator-spark
env:
- name: TEST
value: "hello world"
And I cannot access this environment variable from within my Java Spark Application. Any ideas?
I'm also facing this problem 👀
- this issue should receive more attention. The lack of env var handling for credentials is critical for K8s prod applications
I've spent some time debugging the code and it seems that pod.Namespace is empty when adding the env variables. Because of that the following line in https://github.com/kubeflow/spark-operator/blob/v2.1.0/internal/webhook/sparkpod_defaulter.go#L76-L79 is true:
namespace := pod.Namespace
if !d.isSparkJobNamespace(namespace) {
return nil
}
and does not mutate the env variables within the pod. I managed to fix the issue for default namespaces by harcoding the value whenever namespace is empty, however this needs a closer look to understand why the pod does not have the namespace available at this point in time when the webhook is being triggered.
I've deployed my spark-application into another k8s cluster (not local) and env variables seem to work ok there. I was using Rancher Desktop in my Macbook M1. I wonder if my local setup influenced the behaviour of the spark-operator.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.