manifests Tensorboards-controller crash loop in OOM Killed

Hello there,

We have encountered a problem about the default resources of tensorboards controller.

We deployed kubeflow following this section about 60 days ago, starting from v1.3 and upgraded to v1.4.1

We did not keep observing all the service status, but as far as I can remember, all pods were in Running state when we first deploy kubeflow v1.3

Currently we notice that the tensorboard-controller pod keeps crashing because of OOM Killed.

We expanded the memory requests to 40mi and limits to 60mi, and the error is still there.

But both to 100mi the error disappeared.

Is the default resource limit too small?

Jan 05 '22 07:01 edwardzjl

Hi, Yesterday had same issue when I upgrade Kubeflow 1.4.0 to 1.4.1. In my case tensorboards-controller and training-operator keeps crashing because of OOM Killed. Memory limit set to 100Mi fixed issue.

On another cluster i try fresh install of Kubeflow 1.4.1 and I did not observe this issue.

Jan 12 '22 08:01 veiii

/priority p2 /kind question /area installation @kubeflow/wg-notebooks-leads please review

Jan 13 '22 14:01 jbottum

@jbottum: The label(s) area/installation cannot be applied, because the repository doesn't have them.

In response to this:

/priority p2 /kind question /area installation @kubeflow/wg-notebooks-leads please review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 13 '22 14:01 google-oss-prow[bot]

Thank you @veiii for your comments. Increasing the memory to 100Mi for tensorboard-controller and training-operator solved restart issue for us in kubeflow 1.4 deployment.

Jan 14 '22 01:01 jaiganeshp

Indeed, the memory of most of the controllers have very strict memory limits.

We can also use some bigger defaults, like 100Mi for requests and something like 1Gi for limits

Jan 24 '22 14:01 kimwnasptd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

Apr 28 '22 16:04 stale[bot]

/close

this belongs to Kubeflow/kubeflow

Jan 11 '24 17:01 juliusvonkohout

@juliusvonkohout: Closing this issue.

In response to this:

/close

this belongs to Kubeflow/kubeflow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jan 11 '24 17:01 google-oss-prow[bot]