spark-operator
spark-operator copied to clipboard
SparkConf/HadoopConf from secret
As far as I can see there currently there is no possibility to define spark or hadoop config property from secrets.
This gets rather critical when it is a property related to spark submit, for example JAR fetching of cloud storage where one needs to set credentials via spark.hadoop.fs.[...].
maybe this project is the wrong place to implement that, I didn't dug into the source of spark k8s resource-manager
Can you elaborate more on this?
Accessing external systems from within spark requires credentials most of the times - for example JDBC, AWS S3, etc. These credentials usually reside in kubernetes secrets. To be able to pass those properties currently we need to inject the secrets via environment vars and explicitly fetch them at application level to pass them to spark configuration. Ideally they will be passed to spark-submit via --conf CLI options:
spark-submit --conf spark.hadoop.fs.s3a.access.key=XXXXX --conf spark.hadoop.fs.s3a.secret.key=XXXXX
Azure access keys are a special case because they use account names within the property key like spark.hadoop.fs.azure.account.key.${STORAGE_ACCOUNT_NAME}.dfs.core.windows.net.
One solution would be to enable environment variable substitution at the code that build the submit arguments from sparkConf and hadoopConf and be able to define those env vars at the spark application manifest which will be interpreted by the operator that substitutes the env vars.
We support defining environment variables from secret data (https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/#define-container-environment-variables-using-secret-data). See https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-secrets-as-environment-variables.
sure, but how can we pass the values of ENV vars to spark submit via --conf or spark.properties?
We also can pass config property to spark submit for k8s resource manager to mount secrets as files into the container (https://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management), but the resource manager prevents the user from setting $SPARK_CONF_DIR or passing --properties-file. As said initially, I'm not 100% sure if this functionality should be implemented within spark operator or spark kubernetes operator of spark project.
@rolandjohann you can use your custom image of Spark and pick the env var and set the value to HADOOP_CONF_DIR or just modify the entrypoint script...
@rolandjohann Same problem here, with the same application: attempting to inject cloud secrets into environment variables so that we can pull files (from AWS, using S3, in this case).
TLDR: I need to access a kubernetes secret within the submission runner.
You can set environment variable to secrets in the driver and executor, but you can't set them in the submission runner (as it is called here) of the operator. I'm not 100% on the design, but my understanding is that there is a single 'sparkapplication' type resource that contains the submission runner, controllers, pod monitor, etc. It is this resource that needs access to the secrets, because it is this resource that calls spark-submit.
Where I to run spark-submit in cluster mode from a local VM, and my main application file is at an s3:// address, then the local VM would need access to the secrets in order to pass along the secrets as part of the hadoop config.
Example, if I want to do this on a local vm:
spark-submit
--py-files s3://my_python_deps.zip \
--conf spark.hadoop.fs.s3.awsAccessKeyId=access_key \
--conf spark.hadoop.fs.s3.awsSecretAccessKey=secret_key \
s3://my_application.py
then the equivalent would be something like
type: Python
mode: cluster
pythonVersion: "3"
image: MY_CUSTOM_IMAGE
mainApplicationFile: "s3://my_application.py"
deps:
pyFiles:
- "s3://my_python_deps.zip"
envSecretKeyRefs:
AWS_ACCESS_KEY_ID:
name: aws-group
key: AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY:
name: aws-group
key: AWS_SECRET_ACCESS_KEY
hadoopConf:
"fs.s3.awsAccessKeyId": "$(AWS_ACCESS_KEY_ID)"
"fs.s3.awsSecretAccessKey": "$(AWS_SECRET_ACCESS_KEY)"
However, the 'envSecretKeyRefs` is only available within the driver or executor pods. I don't see any way to set env variables within the submission runner. As a result, when I execute, I get an error and a kubectl describe shows me:
Hadoop Conf:
fs.s3.awsAccessKeyId: $(AWS_ACCESS_KEY_ID)
fs.s3.awsSecretAccessKey: $(AWS_SECRET_ACCESS_KEY)
The environment variables that I am trying to set are just read as strings within the configuration .yaml. I need to access the kubernetes secret within the submission runner.
Hi, @kingledion did you resolved this issue? I am facing the same issue.
@liyinan926 do we have any good way to handle this?
We dropped use of this package because we never got a viable solution for about a month and moved on to other things.
Any update on this? Probably something similar to https://github.com/helm/charts/tree/master/stable/spark-history-server
I have following configuration:
hadoopConf: fs.s3a.access.key: "abc" fs.s3a.secret.key: "abc/def"
I am trying to avoid giving values as plain text.
What would be the best way forward for me? @kingledion @liyinan926
No any news?
What about this?
Define arbitrary params, pulling from environment variable location of your choice
t1 = SparkKubernetesOperator(
...
params={
"foo": "bar",
"username": Variable.get("username"),
"password": Variable.get("password"),
}
)
Template them out in .yaml
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
spec:
type: Scala
mode: cluster
image: "gcr.io/spark-operator/spark:v2.4.4"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar"
sparkVersion: "2.4.4"
arguments:
- {{ params.foo }}
- {{ params.username }}
- {{ params.password }}
...
Did someone find the solution for it? I am facing the same issue.
@gregorygtseng are you able figure out the solution mentioned above ?
@td-shell @ognjen-it @mixMugz Following this article, I seem to have been able to get this to work. There are two key steps:
- Enable mutating webhooks in the Helm installation (
--set webhook.enable=true) - Set the
hadoopConfflag"fs.s3a.aws.credentials.provider"tocom.amazonaws.auth.EnvironmentVariableCredentialsProvider
Step 1 enables environment variables to be set in the driver and executor pods, as discussed in the user guide. Without enabling the webhooks, environment variables are not passed to the pods.
Step 2 tells the AWS JDK to look for environment variables for authentication. In Step 2, you can also substitute sparkConf "spark.hadoop.fs.s3a.aws.credentials.provider" in place of the hadoopConf. The credentials provider will look for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables in the pods, rather than in the submission runner, as @kingledion described.
I got this running on an EKS cluster, using MinIO in place of S3. Here is a gist of the SparkApplication manifest.
@td-shell @ognjen-it @mixMugz Following this article, I seem to have been able to get this to work. There are two key steps:
- Enable mutating webhooks in the Helm installation (
--set webhook.enable=true)- Set the
hadoopConfflag"fs.s3a.aws.credentials.provider"tocom.amazonaws.auth.EnvironmentVariableCredentialsProviderStep 1 enables environment variables to be set in the driver and executor pods, as discussed in the user guide. Without enabling the webhooks, environment variables are not passed to the pods.
Step 2 tells the AWS JDK to look for environment variables for authentication. In Step 2, you can also substitute
sparkConf"spark.hadoop.fs.s3a.aws.credentials.provider"in place of thehadoopConf. The credentials provider will look forAWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYenvironment variables in the pods, rather than in the submission runner, as @kingledion described.I got this running on an EKS cluster, using MinIO in place of S3. Here is a gist of the
SparkApplicationmanifest.
Hello @erikperkins, I had a few doubts:
- Did you also have to patch the AWS JARs inside the Spark Operator itself?
- Did you have to mount the secret in the Spark Operator as well before triggering your YAML?
- How exactly are you providing the secrets inside the Spark submit YAML? I've tried giving
spec.driver.envFromandspec.driver.envbut both don't seem to work for me.