Tensorboards-controller crash loop in OOM Killed
Hello there,
We have encountered a problem about the default resources of tensorboards controller.
We deployed kubeflow following this section about 60 days ago, starting from v1.3 and upgraded to v1.4.1
We did not keep observing all the service status, but as far as I can remember, all pods were in Running state when we first deploy kubeflow v1.3
Currently we notice that the tensorboard-controller pod keeps crashing because of OOM Killed.
We expanded the memory requests to 40mi and limits to 60mi, and the error is still there.
But both to 100mi the error disappeared.
Is the default resource limit too small?
Hi, Yesterday had same issue when I upgrade Kubeflow 1.4.0 to 1.4.1. In my case tensorboards-controller and training-operator keeps crashing because of OOM Killed. Memory limit set to 100Mi fixed issue.
On another cluster i try fresh install of Kubeflow 1.4.1 and I did not observe this issue.
/priority p2 /kind question /area installation @kubeflow/wg-notebooks-leads please review
@jbottum: The label(s) area/installation cannot be applied, because the repository doesn't have them.
In response to this:
/priority p2 /kind question /area installation @kubeflow/wg-notebooks-leads please review
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Thank you @veiii for your comments. Increasing the memory to 100Mi for tensorboard-controller and training-operator solved restart issue for us in kubeflow 1.4 deployment.
Indeed, the memory of most of the controllers have very strict memory limits.
We can also use some bigger defaults, like 100Mi for requests and something like 1Gi for limits
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.
@juliusvonkohout: Closing this issue.
In response to this:
/close
this belongs to Kubeflow/kubeflow
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.