mpi-operator Failed to watch *v2beta1.MPIJob: failed to list *v2beta1.MPIJob: ...

Sometimes I encounter the following log in deploy/mpi-operator, and the operator does not respond to my apply request.

Is this a bug or what?

Aug 23 '22 04:08 thuzhf

cc @zw0610

Aug 23 '22 04:08 gaocegege

Could you post the yaml file of MPIJob that posted? Seems like some issue with the EnvVar field in the MPIJob that the value of an environment variable cannot be set as a boolean value true but a string value "true"?

Aug 23 '22 04:08 zw0610

@zw0610 My yaml is as follows:

Aug 23 '22 04:08 thuzhf

Sometimes I encounter the following log in deploy/mpi-operator, and the operator does not respond to my apply request.

Is this a bug or what?

And when this happens, even if I rollout restart deploy/mpi-operator, it still does not recover from this condition. And until a few minutes later, it recovers from this condition automatically.

Aug 23 '22 04:08 thuzhf

@zw0610 My yaml is as follows:

could you post all EnvVar parts in the MPIJob rather than the cropped one?

Aug 23 '22 04:08 zw0610

@zw0610 My yaml is as follows:

could you post all EnvVar parts in the MPIJob rather than the cropped one?

This is already the full yaml content (no missing lines/parts). I do not write any EnvVar by myself in this yaml file.

Aug 23 '22 04:08 thuzhf

Could you get from APIServer via kubectl instead of the source file you posted? There might be some legacy jobs or mutating webhook that makes the job in the cluster different from the source file.

Aug 23 '22 04:08 zw0610

This is the result of kubectl get pod simple-mpi-worker-1 -o yaml:

Aug 23 '22 05:08 thuzhf

And this is the result of kubectl get pod simple-mpi-launcher-xxxx -o yaml:

Aug 23 '22 05:08 thuzhf

I'm afraid I cannot link this mpijob to the error message on the top unless there are other mpijobs.

Aug 23 '22 06:08 zw0610

Is this the only MPIJob in the cluster?

You wouldn't have pods for this MPIJob, because it fails to parse.

Aug 23 '22 13:08 alculquicondor

Is this the only MPIJob in the cluster?

No. But even if someone else's MPIJob caused this. Why does this cause the whole mpi-operator to not work? I think it is not expected to not work due to someone else apply a wrong yaml. Do I miss something?

Aug 23 '22 13:08 thuzhf

It's how the k8s go client works: it needs to fetch all MPIJobs in the cluster. If one of them doesn't parse, it fails.

I'm not sure if there is a way to skip the failing one. A quick google search didn't give me positive results.

Aug 23 '22 13:08 alculquicondor

In general, we would avoid these problems by having a validation webhook. Unfortunately, we haven't set up one yet.

Aug 23 '22 13:08 alculquicondor

OK. But how can I know the full yaml content of the corresponding yaml to trace it? From the log, I have no idea what yaml/whose yaml caused this log. If I can know the full content of the corresponding yaml, I can easily find the creator of this yaml.

Aug 23 '22 13:08 thuzhf

I would suggest using kubectl get -A mpijobs -o yaml and search for OMPI_ALLOW_RUN_AS_ROOT

Aug 23 '22 14:08 alculquicondor

Understood. Thanks!

Aug 23 '22 14:08 thuzhf

mpi-operator
mpi-operator copied to clipboard

Failed to watch v2beta1.MPIJob: failed to list v2beta1.MPIJob: ...

mpi-operator mpi-operator copied to clipboard

Failed to watch *v2beta1.MPIJob: failed to list *v2beta1.MPIJob: ...

mpi-operator
mpi-operator copied to clipboard

Failed to watch v2beta1.MPIJob: failed to list v2beta1.MPIJob: ...