spark-operator
spark-operator copied to clipboard
Spark operator using local storage for shuffle test
We want to use persistentvolumeclaim as local storage for shuffling the data in the executors.
We have tried the options given in the below link but the executors are not wrtiring any shuffle data to the local storage ( mounted as PVC) https://stackoverflow.com/questions/73567658/using-kubernetes-volumes-as-local-spark-directory-for-executors-to-spill-on
We have the below variables and config setup.
sparkenv.sh
#export SPARK_LOCAL_DIRS=/opt/shuffletest -> Tried adding the PVC path
scpbtcorp@55c615b37dcb:/opt$ ls -ld shuffletest drwxrwxrwx 1 scpbtcorp scpbcorp 4096 Feb 2 06:39 shuffletest scpbtcorp@55c615b37dcb:/opt$
spark-defaults.conf
spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=spark-sc spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=25Gi spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/opt/shuffletest spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=spark-sc spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=25Gi spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/opt/shuffletest spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false spark.local.dir /opt/shuffletest
We tried this option in the spark yaml file. We are using spark operator to trigger the driver and executor pods.
--conf spark.executorEnv.SPARK_LOCAL_DIRS=/opt/shuffletest
We are getting the below error:
24/02/02 06:49:20 ERROR Utils: Failed to create directory /opt/shuffletest/blockmgr-0b9c62e7-342e-4415-a5b4-febb64924294 java.nio.file.AccessDeniedException: /opt/shuffletest/blockmgr-0b9c62e7-342e-4415-a5b4-febb64924294 at java.base/sun.nio.fs.UnixException.translateToIOException(Unknown Source) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at java.base/sun.nio.fs.UnixFileSystemProvider.createDirectory(Unknown Source) at java.base/java.nio.file.Files.createDirectory(Unknown Source) at java.base/java.nio.file.Files.createAndCheckIsDirectory(Unknown Source) at java.base/java.nio.file.Files.createDirectories(Unknown Source) at org.apache.spark.util.Utils$.createDirectory(Utils.scala:324) at org.apache.spark.storage.DiskBlockManager.$anonfun$createLocalDirs$1(DiskBlockManager.scala:252)
getting the error while running the shuffle job TaskSetManager: Finished task 206.10 in stage 6.0 (TID 5183) in 4853 ms on 192.168.22.19 (executor 90) (246/1167) TaskSetManager: Starting task 111.3 in stage 6.0 (TID 5200) (192.168.11.43, executor 87, partition 111, ANY, 5002 bytes) taskResourceAssignments Map() TaskSetManager: Lost task 56.5 in stage 6.0 (TID 5185) on 192.168.11.43, executor 87: java.io.IOException (No space left on device) The node was low on resource: ephemeral-storage. Container spark-kubernetes-executor was using 768Ki, which exceeds its request of 0