spark-operator
spark-operator copied to clipboard
The Driver in a Scheduled Spark Applications keeps getting stuck in ContainerCreating state due to missing ConfigMap
This looks similar to this issue that was closed:
- https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/714
But I'm seeing it happen every few weeks and it doesn't recover. The Config map never gets created and the Driver remains in the ContainerCreating status and never gets re-scheduled. So may pipeline files.
Example of the event warning:
- Warning FailedMount 9m12s (x453 over 15h) kubelet MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-a49d1a8a47c1c0ae-conf-map" not found
I've seen it happen with both the spark-conf-volume-driver
and for the Prometheus metrics ConfigMap mount at /etc/metrics/conf
('xxx-prom-conf-vol')
Does anyone have an idea why these config maps might not get created?
Also, as a workaround, does anyone know if there is any config to timeout the Driver when stuck in the ContainerCreating status, so that hopefully the Scheduled Spark App will launch again and next time create all resources as expected?
A bit more background, the Spark job runs on Schedule every 5 minutes, does some processing of some data and stops after a minute or 2. Then it should start up again on the next scheduled time. This works fine for several days normally until it fails. We've seen this fail due to missing ConfigMaps several times now, so it is starting to make us think it is not reliable.
This is a spark-on-k8s issuse, was described in Unable to Mount ConfigMap in Driver Pod, this pr add retry config when creating Kubernetes resources try to resolve this issuse by retry but have not been merged, but I don't think that pr can fix the issuse. I encount same situation and console print 'Killed' when I use 'spark-submit --master <k8s-url>' in spark operator pod, the reseaon why configmap is not created is that configmap is created after driver pod and 'Killed' occurs after driver pod is created but before configmap
Still facing this issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.