cloud-ml-examples
cloud-ml-examples copied to clipboard
Dask-Kubernetes Example Internal Server Error
When running dask_cuML_Exploration notebook, I ran into the following error:
kubernetes_asyncio.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: <CIMultiDictProxy('Audit-Id': '5ed8d1bf-dd87-4a29-ae0d-fa38c6dc254f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'db4c51d8-2442-4906-93b6-7ed03022eb2e', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'aa178aa3-5472-418c-8f79-a301b839fda3', 'Date': 'Tue, 09 Aug 2022 01:17:22 GMT', 'Content-Length': '242')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"The POST operation against Pod could not be completed at this time, please try again.","reason":"ServerTimeout","details":{"name":"POST","kind":"Pod"},"code":500}
The client environment is setup with this docker file: https://github.com/isVoid/cloud-ml-examples/blob/f2683711b7ff5d1c3ae15ba41b4e828eccf8b2a3/dask/kubernetes/Dockerfile
Pod specs for worker and scheuler.
The cluster is setup on GCP and controlled from within the client's container.
The ServerTimeout
makes me wonder if this is one of those times where the GKE control plane goes unavailable due to resizing. What happens if you try again later?
Tried 7 hours later and running into the same issue.
In a local cluster, I tried setting up the pods with the same specs described above. The error I ran into this time is that when dask-kubernetes
scale up a worker pod that has name
defined, the control plan fails to scale up because attempts to create more worker pod using the same name defined in worker-spec.yaml
. Not sure if this is the same 500 error I ran into as above, hopefully this may give some insight.
import dask_kubernetes
dask_kubernetes.__version__
'2022.7.0'
Environment from: https://jacobtomlinson.dev/posts/2022/running-kubeflow-inside-kind-with-gpu-support/
I think this line should be name: dask-cuda-worker-{uuid}
to ensure the name is unique.
https://github.com/rapidsai/cloud-ml-examples/blob/f2683711b7ff5d1c3ae15ba41b4e828eccf8b2a3/dask/kubernetes/specs/worker-spec.yaml#L16
Doesn't seem to parameterize?
kubernetes_asyncio.client.exceptions.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: <CIMultiDictProxy('Audit-Id': '01188928-aca4-4d94-9e8d-fc89df88e03f', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '7d3b09b3-048f-4edd-b335-46a87cfce5a2', 'X-Kubernetes-Pf-Prioritylevel-Uid': '13223eb9-a260-4d6f-b264-e6a6cede27c6', 'Date': 'Thu, 11 Aug 2022 22:00:17 GMT', 'Content-Length': '1537')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"gpu_worker\" is invalid: [metadata.name: Invalid value: \"gpu_worker\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.containers[0].name: Invalid value: \"dask-cuda-worker-{uuid}\": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')]","reason":"Invalid","details":{"name":"gpu_worker","kind":"Pod","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"gpu_worker\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","field":"metadata.name"},{"reason":"FieldValueInvalid","message":"Invalid value: \"dask-cuda-worker-{uuid}\": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')","field":"spec.containers[0].name"}]},"code":422}
Looks like you have an underscore in the name which is not a valid character.