spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

[QUESTION] Error related with webhook

Open alstjs37 opened this issue 1 year ago • 3 comments

Hello,

I've just encountered an error like this,

When I first installed the spark-operator and run pyspark-pi.py without giving the --set webhook.enable=true option, I checked it worked well.

After that, in order to mount the volume, I removed the spark-operator with helm uninstall and reinstalled it again by giving the --set webhook.enable=ture option, but pyspark-pi does not work now.

My pods in spark-operator namespace

$ sudo kubectl get all -n spark-operator

NAME                                                  READY   STATUS      RESTARTS   AGE
pod/sparkoperator-spark-operator-6994c8bcfd-vns8k     1/1     Running     0          137m
pod/sparkoperator-spark-operator-webhook-init-ww2lw   0/1     Completed   0          137m

NAME                                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/sparkoperator-spark-operator-webhook   ClusterIP   10.107.69.123   <none>        443/TCP   137m

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/sparkoperator-spark-operator   1/1     1            1           137m

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/sparkoperator-spark-operator-6994c8bcfd   1         1         1       137m

NAME                                                  STATUS     COMPLETIONS   DURATION   AGE
job.batch/sparkoperator-spark-operator-webhook-init   Complete   1/1           3s         137m

when i apply yaml file to k8s with below file, i got SUBMISSION_FAILED from sparkapplication ...

here is my pyspark-pi.yaml

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: pyspark-pi
  namespace: spark-operator
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "msleedockerhub/spark-py:py3.0"
  imagePullPolicy: Always
  mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
  sparkVersion: "3.5.1"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.5.1
    serviceAccount: sparkoperator-spark
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.5.1

I'm sure there's no problem with the image I created. If you set up the webhook, what else should I set up in the yaml file?

How can i solve this problem? plz help

alstjs37 avatar May 20 '24 15:05 alstjs37

@alstjs37 Can you use kubectl describe <your-sparkapp> and provide the output? And if pod was created, check pod log too

imtzer avatar May 21 '24 04:05 imtzer

@imtzer thx for you answer

I've already checked, but this is the part of the log that contains the starting point of the error.

Status:
  Application State:
    Error Message:  failed to run spark-submit for SparkApplication spark-operator/pyspark-pi: 24/05/21 04:51:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/05/21 04:51:06 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
24/05/21 04:51:07 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
24/05/21 04:51:07 WARN DriverCommandFeatureStep: spark.kubernetes.pyspark.pythonVersion was deprecated in Spark 3.1. Please set 'spark.pyspark.python' and 'spark.pyspark.driver.python' configurations or PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables instead.
24/05/21 04:51:48 ERROR Client: Please check "kubectl auth can-i create pod" first. It should be yes.
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
  at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
  at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
  at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:44)

I checked Please check "kubectl auth can-i create pod" first this command, but i got yes from k8s.

And then there's a SUBMISSION_FAILED, so pod is not generated 😭

Do you expect anything else?

alstjs37 avatar May 21 '24 05:05 alstjs37

kubectl auth can-i create pod

The output error kubectl auth can-i create pod is throwed in Spark repo KubernetesClientApplication.scala file when using KubernetesClient API to create driver pod

imtzer avatar May 30 '24 09:05 imtzer

I'm facing the same issue with the version v1beta2-1.4.3-3.5.0. @alstjs37 Were you able to fix this issue?

jcunhafonte avatar Jul 01 '24 16:07 jcunhafonte

I think it is rather you did not create service account with proper role or IRSA bound on your service account.

youngsol avatar Jul 10 '24 16:07 youngsol

I am facing the same issue with the helm chart installed and webhook enabled. It worked with webhook disabled but I need the webhook to add tolerations and volumes/volumeMounts to the driver and executor. Anyone figured out the issue yet? When I start a bash shell in the spark-operator container it does answer yes for can-i create pods.

dada-engineer avatar Sep 06 '24 08:09 dada-engineer

A tiny bit of context that probably helps a lot in debugging. We are running on AWS EKS. so there is this handy documentation part: https://www.kubeflow.org/docs/components/spark-operator/getting-started/#mutating-admission-webhooks-on-a-private-gke-or-eks-cluster

Setting the webhook.port value of the helm chart to 443 solved the issue (might also be the case for GCP)

dada-engineer avatar Sep 06 '24 08:09 dada-engineer

This seems to be changed / fixed in v2.0.2. We have installed the chart and got a webhook error as port listening on 443 is now failing with permission denied. The standard webhook port is 9443 which works good 👍🏻

dada-engineer avatar Oct 25 '24 13:10 dada-engineer

Starting from version 2.0.0, the webhook will be enabled by default, with the webhook port defaults to 9443 which is a non-privileged port.

Will close this issue as it should be fixed. Feel free to reopen it if you still have the problem.

ChenYi015 avatar Oct 25 '24 16:10 ChenYi015