cloud-ml-examples icon indicating copy to clipboard operation
cloud-ml-examples copied to clipboard

Dask-Kubernetes Example Internal Server Error

Open isVoid opened this issue 2 years ago • 6 comments

When running dask_cuML_Exploration notebook, I ran into the following error:

kubernetes_asyncio.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: <CIMultiDictProxy('Audit-Id': '5ed8d1bf-dd87-4a29-ae0d-fa38c6dc254f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'db4c51d8-2442-4906-93b6-7ed03022eb2e', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'aa178aa3-5472-418c-8f79-a301b839fda3', 'Date': 'Tue, 09 Aug 2022 01:17:22 GMT', 'Content-Length': '242')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"The POST operation against Pod could not be completed at this time, please try again.","reason":"ServerTimeout","details":{"name":"POST","kind":"Pod"},"code":500}

The client environment is setup with this docker file: https://github.com/isVoid/cloud-ml-examples/blob/f2683711b7ff5d1c3ae15ba41b4e828eccf8b2a3/dask/kubernetes/Dockerfile

Pod specs for worker and scheuler.

The cluster is setup on GCP and controlled from within the client's container.

isVoid avatar Aug 09 '22 01:08 isVoid

The ServerTimeout makes me wonder if this is one of those times where the GKE control plane goes unavailable due to resizing. What happens if you try again later?

jacobtomlinson avatar Aug 09 '22 08:08 jacobtomlinson

Tried 7 hours later and running into the same issue.

isVoid avatar Aug 09 '22 16:08 isVoid

In a local cluster, I tried setting up the pods with the same specs described above. The error I ran into this time is that when dask-kubernetes scale up a worker pod that has name defined, the control plan fails to scale up because attempts to create more worker pod using the same name defined in worker-spec.yaml. Not sure if this is the same 500 error I ran into as above, hopefully this may give some insight.

import dask_kubernetes
dask_kubernetes.__version__
'2022.7.0'

Environment from: https://jacobtomlinson.dev/posts/2022/running-kubeflow-inside-kind-with-gpu-support/

isVoid avatar Aug 09 '22 18:08 isVoid

I think this line should be name: dask-cuda-worker-{uuid} to ensure the name is unique.

https://github.com/rapidsai/cloud-ml-examples/blob/f2683711b7ff5d1c3ae15ba41b4e828eccf8b2a3/dask/kubernetes/specs/worker-spec.yaml#L16

jacobtomlinson avatar Aug 10 '22 12:08 jacobtomlinson

Doesn't seem to parameterize?

kubernetes_asyncio.client.exceptions.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: <CIMultiDictProxy('Audit-Id': '01188928-aca4-4d94-9e8d-fc89df88e03f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7d3b09b3-048f-4edd-b335-46a87cfce5a2', 'X-Kubernetes-Pf-Prioritylevel-Uid': '13223eb9-a260-4d6f-b264-e6a6cede27c6', 'Date': 'Thu, 11 Aug 2022 22:00:17 GMT', 'Content-Length': '1537')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"gpu_worker\" is invalid: [metadata.name: Invalid value: \"gpu_worker\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.containers[0].name: Invalid value: \"dask-cuda-worker-{uuid}\": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')]","reason":"Invalid","details":{"name":"gpu_worker","kind":"Pod","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"gpu_worker\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","field":"metadata.name"},{"reason":"FieldValueInvalid","message":"Invalid value: \"dask-cuda-worker-{uuid}\": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')","field":"spec.containers[0].name"}]},"code":422}

isVoid avatar Aug 11 '22 22:08 isVoid

Looks like you have an underscore in the name which is not a valid character.

jacobtomlinson avatar Aug 12 '22 14:08 jacobtomlinson