kepler-model-server
kepler-model-server copied to clipboard
timely CI error due to DNS failed to resolve the service
What happened?
We found CI error failed from time to time (rerun for multiple times helps it pass)
error: connection error: Post "http://kepler-model-server.kepler.svc.cluster.local:8100/model": dial tcp: lookup kepler-model-server.kepler.svc.cluster.local on 10.96.0.10:53: no such host (http://kepler-model-server.kepler.svc.cluster.local:8100/model))
Error from server (InternalError): error when creating "tasks/train-task.yaml": Internal error occurred: failed calling webhook "webhook.pipeline.tekton.dev": failed to call webhook: Post "[https://tekton-pipelines-webhook.tekton-pipelines.svc:443/defaulting?timeout=10s](https://tekton-pipelines-webhook.tekton-pipelines.svc/defaulting?timeout=10s)": dial tcp 10.96.111.114:443: connect: connection refused
What did you expect to happen?
Investigate root cause and fix
How can we reproduce it (as minimally and precisely as possible)?
Push PR
Anything else we need to know?
No response
Kepler image tag
Deployment
- [ ] Model server
- [ ] Estimator
- [ ] Online trainer
- [ ] Offline trainer
- [ ] Profiler
Kepler model server image tag if deployed
Kepler estimator image tag if deployed
Kepler online trainer image tag if deployed
Kepler offline trainer image tag if deployed
Kepler profiler image tag if deployed
Kubernetes version
$ kubectl version
# paste output here
Install tools
Kepler deployment config
For on kubernetes:
$ KEPLER_NAMESPACE=kepler
# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE}
# paste output here
# provide kepler model server configmap if Kepler Model Server is deployed
$ kubectl get configmap kepler-model-server-cfm -n ${KEPLER_NAMESPACE}
# paste output here
# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE}
For standalone: