mpi-operator
mpi-operator copied to clipboard
Failed to watch *v2beta1.MPIJob: failed to list *v2beta1.MPIJob: ...
Sometimes I encounter the following log in deploy/mpi-operator
, and the operator does not respond to my apply request.
Is this a bug or what?
cc @zw0610
Could you post the yaml file of MPIJob
that posted? Seems like some issue with the EnvVar
field in the MPIJob
that the value of an environment variable cannot be set as a boolean value true
but a string value "true"
?
@zw0610 My yaml is as follows:
Sometimes I encounter the following log in
deploy/mpi-operator
, and the operator does not respond to my apply request.Is this a bug or what?
And when this happens, even if I rollout restart deploy/mpi-operator
, it still does not recover from this condition. And until a few minutes later, it recovers from this condition automatically.
@zw0610 My yaml is as follows:
could you post all EnvVar
parts in the MPIJob
rather than the cropped one?
@zw0610 My yaml is as follows:
could you post all
EnvVar
parts in theMPIJob
rather than the cropped one?
This is already the full yaml content (no missing lines/parts). I do not write any EnvVar
by myself in this yaml file.
Could you get from APIServer via kubectl
instead of the source file you posted? There might be some legacy jobs or mutating webhook that makes the job in the cluster different from the source file.
This is the result of kubectl get pod simple-mpi-worker-1 -o yaml
:
And this is the result of kubectl get pod simple-mpi-launcher-xxxx -o yaml
:
I'm afraid I cannot link this mpijob to the error message on the top unless there are other mpijobs.
Is this the only MPIJob in the cluster?
You wouldn't have pods for this MPIJob, because it fails to parse.
Is this the only MPIJob in the cluster?
No. But even if someone else's MPIJob caused this. Why does this cause the whole mpi-operator
to not work? I think it is not expected to not work due to someone else apply a wrong yaml. Do I miss something?
It's how the k8s go client works: it needs to fetch all MPIJobs in the cluster. If one of them doesn't parse, it fails.
I'm not sure if there is a way to skip the failing one. A quick google search didn't give me positive results.
In general, we would avoid these problems by having a validation webhook. Unfortunately, we haven't set up one yet.
OK. But how can I know the full yaml content of the corresponding yaml to trace it? From the log, I have no idea what yaml/whose yaml caused this log. If I can know the full content of the corresponding yaml, I can easily find the creator of this yaml.
I would suggest using kubectl get -A mpijobs -o yaml
and search for OMPI_ALLOW_RUN_AS_ROOT
Understood. Thanks!