spark-eks
spark-eks copied to clipboard
Examples and custom spark images for working with the spark-on-k8s operator on AWS
spark-on-eks
Examples and custom spark images for working with the spark-on-k8s operator on AWS.
Allows using Spark 2 with IRSA and Spark 3 with IRSA and AWS Glue as a metastore.
Note: Spark 3 images also include relevant jars for working with the S3A commiters
If you're looking for the Spark 3 custom distributions, you can find them here
Note: Spark 2 images will not be updated, please see the FAQ
Prerequisites
- Deploy spark-on-k8s operator using the helm chart and the patched operator image
bbenzikry/spark-eks-operator:latest
Suggested values for the helm chart can be found in the flux example.
Note: Do not create the spark service account automatically as part of chart use.
using IAM roles for service accounts on EKS
Creating roles and service account
- Create an AWS role for driver
- Create an AWS role for executors
- Add default service account EKS role for executors in your spark job namespace ( optional )
# NOTE: Only required when not building spark from source or using a version of spark < 3.1. In 3.1, executor roles will rely on the driver definition. At the moment they execute with the default service account.
apiVersion: v1
kind: ServiceAccount
metadata:
name: default
namespace: SPARK_JOB_NAMESPACE
annotations:
# can also be the driver role
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/executor-role"
- Make sure spark service account ( used by driver pods ) is configured to an EKS role as well
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: SPARK_JOB_NAMESPACE
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::ACCOUNT_ID:role/driver-role"
Building a compatible image
-
For spark < 3.0.0, see spark2.Dockerfile
-
For spark 3.0.0+, see spark3.Dockerfile
-
For pyspark, see pyspark.Dockerfile
Submit your spark application with IRSA support
Select the right implementation for you
Below are examples for latest versions.
If you want to use pinned versions, all images are tagged by the commit SHA.
You can find a full list of tags here
# spark2
FROM bbenzikry/spark-eks:spark2-latest
# spark3
FROM bbenzikry/spark-eks:spark3-latest
# pyspark2
FROM bbenzikry/spark-eks:pyspark2-latest
# pyspark3
FROM bbenzikry/spark-eks:pyspark3-latest
Submit your SparkApplication spec
hadoopConf:
# IRSA configuration
"fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
driver:
.....
labels:
.....
serviceAccount: SERVICE_ACCOUNT_NAME
# See: https://github.com/kubernetes/kubernetes/issues/82573
# Note: securityContext has changed in recent versions of the operator to podSecurityContext
podSecurityContext:
fsGroup: 65534
Working with AWS Glue as metastore
Glue Prerequisites
- Make sure your driver and executor roles have the relevant glue permissions
{
/*
Example below depicts the IAM policy for accessing db1/table1.
Modify this as you deem worthy for spark application access.
*/
Effect: "Allow",
Action: ["glue:*Database*", "glue:*Table*", "glue:*Partition*"],
Resource: [
"arn:aws:glue:us-west-2:123456789012:catalog",
"arn:aws:glue:us-west-2:123456789012:database/db1",
"arn:aws:glue:us-west-2:123456789012:table/db1/table1",
"arn:aws:glue:eu-west-1:123456789012:database/default",
"arn:aws:glue:eu-west-1:123456789012:database/global_temp",
"arn:aws:glue:eu-west-1:123456789012:database/parquet",
],
}
- Make sure you are using the patched operator image
- Add a config map to your spark job namespace as defined here
apiVersion: v1
data:
hive-site.xml: |-
<configuration>
<property>
<name>hive.imetastoreclient.factory.class</name>
<value>com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory</value>
</property>
</configuration>
kind: ConfigMap
metadata:
namespace: SPARK_JOB_NAMESPACE
name: spark-custom-config-map
Submitting your application
In order to submit an application with glue support, you need to add a reference to the configmap in your SparkApplication spec.
kind: SparkApplication
metadata:
name: "my-spark-app"
namespace: SPARK_JOB_NAMESPACE
spec:
sparkConfigMap: spark-custom-config-map
Working with the spark history server on S3
-
Use the appropriate spark version and deploy the helm chart
-
Flux / Helm values reference here
FAQ
-
Where can I find a Spark 2 build with Glue support?
As spark 2 becomes less and less relevant, I opted against the need to add glue support. You can take a look here for a reference build script which you can use to build a Spark 2 distribution to use with the Spark 2 dockerfile
-
Why a patched operator image?
The patched image is a simple implementation for properly working with custom configuration files with the spark operator. It may be added as a PR in the future or another implementation will take its place. For more information, see the related issue https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/216