versatile-data-kit
versatile-data-kit copied to clipboard
Improve the robustness of data job deployments
Describe the bug Occasionally data job deployments fail due to connectivity issues during Kubernetes watch. This watch is initiated by the Control Service in order to be notified when the builder job completes. The error that can be seen below:
Improve the robustness of the watch mechanic by introducing error handling with retries.
Steps To Reproduce None
Expected behavior The data job deployment should not fail when the Control Service fails to initiate a watch of the Kubernetes builder job.
Additional context The error that happens during watch initialization is:
Nov 3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t at com.vmware.taurus.service.deploy.DeploymentService.updateDeployment(DeploymentService.java:98)
Nov 3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t at com.vmware.taurus.service.deploy.JobImageBuilder.buildImage(JobImageBuilder.java:160)
Nov 3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t at com.vmware.taurus.service.KubernetesService.watchJob(KubernetesService.java:718)
Nov 3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t at com.vmware.taurus.service.KubernetesService.watchJobInternal(KubernetesService.java:738)
Nov 3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t at io.kubernetes.client.util.Watch.createWatch(Watch.java:108)
Nov 3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t Caused by: io.kubernetes.client.ApiException: javax.net.ssl.SSLException: Couldn't kickstart handshaking
Hi @tpalashki ,
Thanks for reporting this.
Can you specify some steps to reproduce - best effort. Under what circumstances (pre-requisites) does the issue happen to the best of your knowledge? And also how frequently (in terms of a number of deployments - is it 1 in 100 deployments or 1 in 10)?
It is not possible to root cause this particular error. It happens very rarely (I have seen it only once) and searching for the above error yields no results. However, createWatch can fail for many reasons and we wouldn't want this to cause a failure of the deployment operation (all the more so that the builder job completed successfully). My idea was to introduce some failure handling with retry should the watch fail.
closed as unable to reproduce