oshinko-s2i
oshinko-s2i copied to clipboard
ephemeral clusters not getting deleted for jobs
while testing the build and job workflows i've run into a situation where it appears that ephemeral clusters are not getting deleted even with the delete cluster option set to true.
steps to reproduce
oc new-project testoc create -f https://radanalytics.io/resources.yamloc create -f pysparkbuild.jsonoc create -f pysparkjob.jsonoc new-app --template oshinko-pyspark-build -p GIT_URI=https://github.com/radanalyticsio/s2i-integration-test-appsoc new-app --template oshinko-pyspark-job -p IMAGE=<Docker pull spec here>
observed result
the cluster created for the job is never cleaned, and the output seems to not recognize that it is ephemeral.
logs
18/01/04 16:07:04 INFO SparkContext: Invoking stop() from shutdown hook
18/01/04 16:07:04 INFO SparkUI: Stopped Spark web UI at http://172.17.0.2:4040
18/01/04 16:07:04 INFO StandaloneSchedulerBackend: Shutting down all executors
18/01/04 16:07:04 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/01/04 16:07:04 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/01/04 16:07:04 INFO MemoryStore: MemoryStore cleared
18/01/04 16:07:04 INFO BlockManager: BlockManager stopped
18/01/04 16:07:04 INFO BlockManagerMaster: BlockManagerMaster stopped
18/01/04 16:07:04 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/01/04 16:07:04 INFO SparkContext: Successfully stopped SparkContext
18/01/04 16:07:04 INFO ShutdownHookManager: Shutdown hook called
18/01/04 16:07:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-05af0c6c-9bb4-49a7-b62d-6a67b83fc749/pyspark-a5c3ba26-8a73-4e22-b314-602f07296267
18/01/04 16:07:04 INFO ShutdownHookManager: Deleting directory /tmp/spark-05af0c6c-9bb4-49a7-b62d-6a67b83fc749
Deleting cluster 'cluster-efdcc4'
cluster is not ephemeral
cluster not deleted 'cluster-efdcc4'
the pods are never deleted,
$ oc get pods
NAME READY STATUS RESTARTS AGE
cluster-efdcc4-m-1-l7xds 1/1 Running 0 12m
cluster-efdcc4-w-1-v7kmn 1/1 Running 0 12m
pyspark-m6va-cb82v 0/1 Completed 0 12m
pyspark-y8bl-1-build 0/1 Completed 0 30m
expected result
all cluster pods should be deleted after the job has completed.
possible cause
i think that the way the $ephemeral variable is being calculated in this function in the common start script is probably causing the issues here. it probably needs to account for jobs differently than deployments.
This is a known limitation, since ephemeral-ness is tracked via labels on deploymentconfigs.
We need another solution for jobs