mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

Failed to watch *v2beta1.MPIJob: failed to list *v2beta1.MPIJob: ...

Open thuzhf opened this issue 2 years ago • 17 comments

Sometimes I encounter the following log in deploy/mpi-operator, and the operator does not respond to my apply request. image

Is this a bug or what?

thuzhf avatar Aug 23 '22 04:08 thuzhf

cc @zw0610

gaocegege avatar Aug 23 '22 04:08 gaocegege

Could you post the yaml file of MPIJob that posted? Seems like some issue with the EnvVar field in the MPIJob that the value of an environment variable cannot be set as a boolean value true but a string value "true"?

zw0610 avatar Aug 23 '22 04:08 zw0610

@zw0610 My yaml is as follows: image

thuzhf avatar Aug 23 '22 04:08 thuzhf

Sometimes I encounter the following log in deploy/mpi-operator, and the operator does not respond to my apply request. image

Is this a bug or what?

And when this happens, even if I rollout restart deploy/mpi-operator, it still does not recover from this condition. And until a few minutes later, it recovers from this condition automatically.

thuzhf avatar Aug 23 '22 04:08 thuzhf

@zw0610 My yaml is as follows: image

could you post all EnvVar parts in the MPIJob rather than the cropped one?

zw0610 avatar Aug 23 '22 04:08 zw0610

@zw0610 My yaml is as follows: image

could you post all EnvVar parts in the MPIJob rather than the cropped one?

This is already the full yaml content (no missing lines/parts). I do not write any EnvVar by myself in this yaml file.

thuzhf avatar Aug 23 '22 04:08 thuzhf

Could you get from APIServer via kubectl instead of the source file you posted? There might be some legacy jobs or mutating webhook that makes the job in the cluster different from the source file.

zw0610 avatar Aug 23 '22 04:08 zw0610

This is the result of kubectl get pod simple-mpi-worker-1 -o yaml: image

thuzhf avatar Aug 23 '22 05:08 thuzhf

And this is the result of kubectl get pod simple-mpi-launcher-xxxx -o yaml: image

thuzhf avatar Aug 23 '22 05:08 thuzhf

I'm afraid I cannot link this mpijob to the error message on the top unless there are other mpijobs.

zw0610 avatar Aug 23 '22 06:08 zw0610

Is this the only MPIJob in the cluster?

You wouldn't have pods for this MPIJob, because it fails to parse.

alculquicondor avatar Aug 23 '22 13:08 alculquicondor

Is this the only MPIJob in the cluster?

No. But even if someone else's MPIJob caused this. Why does this cause the whole mpi-operator to not work? I think it is not expected to not work due to someone else apply a wrong yaml. Do I miss something?

thuzhf avatar Aug 23 '22 13:08 thuzhf

It's how the k8s go client works: it needs to fetch all MPIJobs in the cluster. If one of them doesn't parse, it fails.

I'm not sure if there is a way to skip the failing one. A quick google search didn't give me positive results.

alculquicondor avatar Aug 23 '22 13:08 alculquicondor

In general, we would avoid these problems by having a validation webhook. Unfortunately, we haven't set up one yet.

alculquicondor avatar Aug 23 '22 13:08 alculquicondor

OK. But how can I know the full yaml content of the corresponding yaml to trace it? From the log, I have no idea what yaml/whose yaml caused this log. If I can know the full content of the corresponding yaml, I can easily find the creator of this yaml.

thuzhf avatar Aug 23 '22 13:08 thuzhf

I would suggest using kubectl get -A mpijobs -o yaml and search for OMPI_ALLOW_RUN_AS_ROOT

alculquicondor avatar Aug 23 '22 14:08 alculquicondor

Understood. Thanks!

thuzhf avatar Aug 23 '22 14:08 thuzhf