spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Volume name too long (must be no more than 63 characters)

Open robertogyn19 opened this issue 2 years ago • 4 comments

Hi all,

I'm using the spark-operator: chart version: spark-operator-1.1.11 and app version: v1beta2-1.2.3-3.1.1 I'm getting this error with a Scheduled Spark Application.

failed to run spark-submit for SparkApplication bigdata-spark/storage-academy-argos-new-schema-1655202518288923417: WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/06/14 10:33:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/06/14 10:33:39 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
22/06/14 10:33:40 WARN DriverServiceFeatureStep: Driver's hostname would preferably be storage-academy-argos-new-schema-1655202518288923417-c4b2938161c6af28-driver-svc, but this is too long (must be \u003c= 63 characters). Falling back to use spark-31e8518161c6b089-driver-svc as the driver service's name.
22/06/14 10:33:40 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
Exception in thread \"main\" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.92.0.1/api/v1/namespaces/bigdata-spark/pods. Message: Pod \"storage-academy-argos-new-schema-1655202518288923417-driver\" is invalid: [spec.volumes[5].name: Invalid value: \"storage-academy-argos-new-schema-1655202518288923417-prom-conf-vol\": must be no more than 63 characters, spec.containers[0].volumeMounts[5].name: Not found: \"storage-academy-argos-new-schema-1655202518288923417-prom-conf-vol\"]. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.volumes[5].name, message=Invalid value: \"storage-academy-argos-new-schema-1655202518288923417-prom-conf-vol\": must be no more than 63 characters, reason=FieldValueInvalid, additionalProperties={}), StatusCause(field=spec.containers[0].volumeMounts[5].name, message=Not found: \"storage-academy-argos-new-schema-1655202518288923417-prom-conf-vol\", reason=FieldValueNotFound, additionalProperties={})], group=null, kind=Pod, name=storage-academy-argos-new-schema-1655202518288923417-driver, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Pod \"storage-academy-argos-new-schema-1655202518288923417-driver\" is invalid: [spec.volumes[5].name: Invalid value: \"storage-academy-argos-new-schema-1655202518288923417-prom-conf-vol\": must be no more than 63 characters, spec.containers[0].volumeMounts[5].name: Not found: \"storage-academy-argos-new-schema-1655202518288923417-prom-conf-vol\"], metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:589)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:528)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:492)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:451)
  at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:252)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:879)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:341)
  at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:84)
  at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:139)
  at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
  at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2611)
  at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
  at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
  at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
  at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
22/06/14 10:33:41 INFO ShutdownHookManager: Shutdown hook called
22/06/14 10:33:41 INFO ShutdownHookManager: Deleting directory /tmp/spark-5dedfe5f-637a-47a5-9050-eed3d4f1c96e

I know the name storage-academy-argos-new-schema-1655202518288923417-prom-conf-vol is longer than 63 chars, but there is any way to overwrite this long suffix (UnixNano-prom-conf-vol)?

robertogyn19 avatar Jun 14 '22 14:06 robertogyn19

@robertogyn19 Why and where was the storage created? Assuming that you are running Spark3.1.1, the issue stems from apache/spark here: https://github.com/apache/spark/blob/v3.1.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L139

I think this is not a spark-operator issue but a apache/spark argocd issue.

But overall I think the issue stems from your declaration file. Can you please share it?

tafaust avatar Jul 05 '22 09:07 tafaust

Hi @tahesse, I create a declaration to reproduce this error. I think this only applies to scheduled applications.

Seems like the controller adds a suffix with the schedule name and a time.UnixNano here. Later on the suffix "-prom-conf" is concatenated with the application name and finally the "-vol" is concatenated to the prometheus volume name.

Maybe the same logic applied here can be used with prometheus configuration?

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
  name: spark-pi-prometheus-with-a-long-name
  namespace: spark
spec:
  schedule: "@every 1m"
  concurrencyPolicy: Forbid
  template:
    type: Scala
    mode: cluster
    image: "gcr.io/spark-operator/spark:v3.0.0-gcs-prometheus"
    imagePullPolicy: IfNotPresent
    mainClass: org.apache.spark.examples.SparkPi
    mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
    arguments:
      - "1000"
    sparkVersion: "3.1.1"
    restartPolicy:
      type: Never
    driver:
      cores: 1
      coreLimit: "1200m"
      memory: "512m"
      labels:
        version: 3.1.1
      serviceAccount: spark
    executor:
      cores: 1
      instances: 1
      memory: "512m"
      labels:
        version: 3.1.1
    monitoring:
      exposeDriverMetrics: true
      exposeExecutorMetrics: true
      prometheus:
        jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.11.0.jar"
        port: 8090

I'm testing with the v3.0.0-gcs-prometheus image because the v3.1.1-gcs-prometheus doesn't exists in the gcr.io/spark-operator/spark.

robertogyn19 avatar Jul 09 '22 11:07 robertogyn19

@robertogyn19 I haven't had time yet to test, sorry. So why don't you just replace the UnixNano with some unique sha? Other than that you could truncate the string parts to fulfill the 63 max char constraint while retaining uniqueness. See https://dev.to/derfenix/comment/1hp14 for string truncation in golang. The maintainers are slow to respond to PRs, so I'd suggest to do a PR and build the operator on your side as well in the meantime.

AFAIK patch.go works completely different as it only does JSON patch ops (e.g. to inject owner refs). I did some work on VolumeMounts (PR dangling) and the patching for e.g. SPARK_CONF doesn't work atm.

tafaust avatar Jul 11 '22 18:07 tafaust

This happened for normal (non-scheduled) SparkApplication for me. Considering that the operator already automatically handles too long pod names, I think it's reasonable to ask that it should handle volume names, too.

Dlougach avatar Oct 26 '23 12:10 Dlougach