spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

SparkConf/HadoopConf from secret

Open rolandjohann opened this issue 5 years ago • 16 comments

As far as I can see there currently there is no possibility to define spark or hadoop config property from secrets.

This gets rather critical when it is a property related to spark submit, for example JAR fetching of cloud storage where one needs to set credentials via spark.hadoop.fs.[...].

maybe this project is the wrong place to implement that, I didn't dug into the source of spark k8s resource-manager

rolandjohann avatar Aug 05 '19 14:08 rolandjohann

Can you elaborate more on this?

liyinan926 avatar Aug 07 '19 21:08 liyinan926

Accessing external systems from within spark requires credentials most of the times - for example JDBC, AWS S3, etc. These credentials usually reside in kubernetes secrets. To be able to pass those properties currently we need to inject the secrets via environment vars and explicitly fetch them at application level to pass them to spark configuration. Ideally they will be passed to spark-submit via --conf CLI options:

spark-submit --conf spark.hadoop.fs.s3a.access.key=XXXXX --conf spark.hadoop.fs.s3a.secret.key=XXXXX

Azure access keys are a special case because they use account names within the property key like spark.hadoop.fs.azure.account.key.${STORAGE_ACCOUNT_NAME}.dfs.core.windows.net.

One solution would be to enable environment variable substitution at the code that build the submit arguments from sparkConf and hadoopConf and be able to define those env vars at the spark application manifest which will be interpreted by the operator that substitutes the env vars.

rolandjohann avatar Aug 13 '19 11:08 rolandjohann

We support defining environment variables from secret data (https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/#define-container-environment-variables-using-secret-data). See https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-secrets-as-environment-variables.

liyinan926 avatar Aug 13 '19 18:08 liyinan926

sure, but how can we pass the values of ENV vars to spark submit via --conf or spark.properties?

We also can pass config property to spark submit for k8s resource manager to mount secrets as files into the container (https://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management), but the resource manager prevents the user from setting $SPARK_CONF_DIR or passing --properties-file. As said initially, I'm not 100% sure if this functionality should be implemented within spark operator or spark kubernetes operator of spark project.

rolandjohann avatar Aug 14 '19 16:08 rolandjohann

@rolandjohann you can use your custom image of Spark and pick the env var and set the value to HADOOP_CONF_DIR or just modify the entrypoint script...

skonto avatar Sep 04 '19 08:09 skonto

@rolandjohann Same problem here, with the same application: attempting to inject cloud secrets into environment variables so that we can pull files (from AWS, using S3, in this case).

TLDR: I need to access a kubernetes secret within the submission runner.

You can set environment variable to secrets in the driver and executor, but you can't set them in the submission runner (as it is called here) of the operator. I'm not 100% on the design, but my understanding is that there is a single 'sparkapplication' type resource that contains the submission runner, controllers, pod monitor, etc. It is this resource that needs access to the secrets, because it is this resource that calls spark-submit.

Where I to run spark-submit in cluster mode from a local VM, and my main application file is at an s3:// address, then the local VM would need access to the secrets in order to pass along the secrets as part of the hadoop config.

Example, if I want to do this on a local vm:

spark-submit 
    --py-files s3://my_python_deps.zip \
    --conf spark.hadoop.fs.s3.awsAccessKeyId=access_key \
    --conf spark.hadoop.fs.s3.awsSecretAccessKey=secret_key \
   s3://my_application.py

then the equivalent would be something like

  type: Python
  mode: cluster
  pythonVersion: "3"
  image: MY_CUSTOM_IMAGE
  mainApplicationFile: "s3://my_application.py"
  deps:
    pyFiles:
      - "s3://my_python_deps.zip"
  envSecretKeyRefs:
    AWS_ACCESS_KEY_ID:
      name: aws-group
      key: AWS_ACCESS_KEY_ID
    AWS_SECRET_ACCESS_KEY:
      name: aws-group
      key: AWS_SECRET_ACCESS_KEY
  hadoopConf:
    "fs.s3.awsAccessKeyId": "$(AWS_ACCESS_KEY_ID)"
    "fs.s3.awsSecretAccessKey": "$(AWS_SECRET_ACCESS_KEY)"

However, the 'envSecretKeyRefs` is only available within the driver or executor pods. I don't see any way to set env variables within the submission runner. As a result, when I execute, I get an error and a kubectl describe shows me:

  Hadoop Conf:
    fs.s3.awsAccessKeyId:      $(AWS_ACCESS_KEY_ID)
    fs.s3.awsSecretAccessKey:  $(AWS_SECRET_ACCESS_KEY)

The environment variables that I am trying to set are just read as strings within the configuration .yaml. I need to access the kubernetes secret within the submission runner.

kingledion avatar Oct 21 '19 20:10 kingledion

Hi, @kingledion did you resolved this issue? I am facing the same issue.

@liyinan926 do we have any good way to handle this?

sshah90 avatar Jun 16 '20 14:06 sshah90

We dropped use of this package because we never got a viable solution for about a month and moved on to other things.

kingledion avatar Jun 16 '20 19:06 kingledion

Any update on this? Probably something similar to https://github.com/helm/charts/tree/master/stable/spark-history-server

I have following configuration:

hadoopConf: fs.s3a.access.key: "abc" fs.s3a.secret.key: "abc/def"

I am trying to avoid giving values as plain text.

What would be the best way forward for me? @kingledion @liyinan926

suchitgupta01 avatar Oct 21 '20 15:10 suchitgupta01

No any news?

mixMugz avatar Feb 17 '22 16:02 mixMugz

What about this?

Define arbitrary params, pulling from environment variable location of your choice

t1 = SparkKubernetesOperator(
    ...
    params={
        "foo": "bar",
        "username": Variable.get("username"),
        "password": Variable.get("password"),
    }
)

Template them out in .yaml

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-pi
spec:
  type: Scala
  mode: cluster
  image: "gcr.io/spark-operator/spark:v2.4.4"
  imagePullPolicy: Always
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar"
  sparkVersion: "2.4.4"
  arguments:
  - {{ params.foo }}
  - {{ params.username }}
  - {{ params.password }}
...

gregorygtseng avatar May 24 '22 22:05 gregorygtseng

Did someone find the solution for it? I am facing the same issue.

ognjen-it avatar Jul 27 '22 13:07 ognjen-it

@gregorygtseng are you able figure out the solution mentioned above ?

td-shell avatar Oct 03 '22 18:10 td-shell

@td-shell @ognjen-it @mixMugz Following this article, I seem to have been able to get this to work. There are two key steps:

  1. Enable mutating webhooks in the Helm installation (--set webhook.enable=true)
  2. Set the hadoopConf flag "fs.s3a.aws.credentials.provider" to com.amazonaws.auth.EnvironmentVariableCredentialsProvider

Step 1 enables environment variables to be set in the driver and executor pods, as discussed in the user guide. Without enabling the webhooks, environment variables are not passed to the pods.

Step 2 tells the AWS JDK to look for environment variables for authentication. In Step 2, you can also substitute sparkConf "spark.hadoop.fs.s3a.aws.credentials.provider" in place of the hadoopConf. The credentials provider will look for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables in the pods, rather than in the submission runner, as @kingledion described.

I got this running on an EKS cluster, using MinIO in place of S3. Here is a gist of the SparkApplication manifest.

erikperkins avatar Mar 16 '23 00:03 erikperkins

@td-shell @ognjen-it @mixMugz Following this article, I seem to have been able to get this to work. There are two key steps:

  1. Enable mutating webhooks in the Helm installation (--set webhook.enable=true)
  2. Set the hadoopConf flag "fs.s3a.aws.credentials.provider" to com.amazonaws.auth.EnvironmentVariableCredentialsProvider

Step 1 enables environment variables to be set in the driver and executor pods, as discussed in the user guide. Without enabling the webhooks, environment variables are not passed to the pods.

Step 2 tells the AWS JDK to look for environment variables for authentication. In Step 2, you can also substitute sparkConf "spark.hadoop.fs.s3a.aws.credentials.provider" in place of the hadoopConf. The credentials provider will look for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables in the pods, rather than in the submission runner, as @kingledion described.

I got this running on an EKS cluster, using MinIO in place of S3. Here is a gist of the SparkApplication manifest.

Hello @erikperkins, I had a few doubts:

  1. Did you also have to patch the AWS JARs inside the Spark Operator itself?
  2. Did you have to mount the secret in the Spark Operator as well before triggering your YAML?
  3. How exactly are you providing the secrets inside the Spark submit YAML? I've tried giving spec.driver.envFrom and spec.driver.env but both don't seem to work for me.

Sparker0i avatar Mar 26 '23 10:03 Sparker0i