versatile-data-kit icon indicating copy to clipboard operation
versatile-data-kit copied to clipboard

Improve the robustness of data job deployments

Open tpalashki opened this issue 3 years ago • 2 comments

Describe the bug Occasionally data job deployments fail due to connectivity issues during Kubernetes watch. This watch is initiated by the Control Service in order to be notified when the builder job completes. The error that can be seen below:

Improve the robustness of the watch mechanic by introducing error handling with retries.

Steps To Reproduce None

Expected behavior The data job deployment should not fail when the Control Service fails to initiate a watch of the Kubernetes builder job.

Additional context The error that happens during watch initialization is:

Nov  3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t 	at com.vmware.taurus.service.deploy.DeploymentService.updateDeployment(DeploymentService.java:98)
Nov  3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t 	at com.vmware.taurus.service.deploy.JobImageBuilder.buildImage(JobImageBuilder.java:160)
Nov  3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t 	at com.vmware.taurus.service.KubernetesService.watchJob(KubernetesService.java:718)
Nov  3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t 	at com.vmware.taurus.service.KubernetesService.watchJobInternal(KubernetesService.java:738)
Nov  3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t 	at io.kubernetes.client.util.Watch.createWatch(Watch.java:108)
Nov  3 08:35:45 tpcs-dep-774b4b4bdb-g9d9t 	Caused by: io.kubernetes.client.ApiException: javax.net.ssl.SSLException: Couldn't kickstart handshaking

tpalashki avatar Nov 04 '21 09:11 tpalashki

Hi @tpalashki ,

Thanks for reporting this.

Can you specify some steps to reproduce - best effort. Under what circumstances (pre-requisites) does the issue happen to the best of your knowledge? And also how frequently (in terms of a number of deployments - is it 1 in 100 deployments or 1 in 10)?

antoniivanov avatar Nov 08 '21 13:11 antoniivanov

It is not possible to root cause this particular error. It happens very rarely (I have seen it only once) and searching for the above error yields no results. However, createWatch can fail for many reasons and we wouldn't want this to cause a failure of the deployment operation (all the more so that the builder job completed successfully). My idea was to introduce some failure handling with retry should the watch fail.

tpalashki avatar Nov 08 '21 14:11 tpalashki

closed as unable to reproduce

antoniivanov avatar Aug 31 '22 06:08 antoniivanov