Culling prevents re-starting culled Notebooks
The issue
With Kubeflow 1.6, we've repeatedly observed Notebooks wouldn't start anymore after they've been culled. Apparently, the notebook controller writes the last-activity annotation before culling the Notebook, however, doesn't remove this annotation before start. This causes the Notebook to be culled again before is has a chance to start.
So: Culling is currently broken and not usable.
How to reproduce
- Clone Kubeflow's
v1.6-branch - Enable culling by appending the following patch to
example/kustomization.yaml:
patchesStrategicMerge:
- |
apiVersion: apps/v1
kind: Deployment
metadata:
name: notebook-controller-deployment
namespace: kubeflow
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: manager
imagePullPolicy: IfNotPresent
env:
- name: ADD_FSGROUP
value: "false"
- name: ENABLE_CULLING
value: "true"
- name: CULL_IDLE_TIME
value: "360"
- name: IDLENESS_CHECK_PERIOD
value: "10"
- Start a Notebook using any name or image
- Simulate culling by stopping the notebook using Kubeflow's UI. Then run:
kubectl -n kubeflow-user-example-com annotate notebook --overwrite --all notebooks.kubeflow.org/last-activity=2022-01-01T00:00:00Z. Alternatively, reduceCULL_IDLE_TIMEto some low value like "3". - The notebook won't start anymore. You'll see the following notebook controller logs:
1.6631558760852754e+09 INFO controllers.Notebook Updating StatefulSet {"notebook": "kubeflow-user-example-com/test123", "namespace": "kubeflow-user-example-com", "name": "test123"}
1.6631558760989938e+09 INFO controllers.Notebook Pod not found... {"notebook": "kubeflow-user-example-com/test123"}
1.6631558760990713e+09 INFO controllers.Notebook Initializing Notebook CR Status {"notebook": "kubeflow-user-example-com/test123"}
1.663155876099081e+09 INFO controllers.Notebook Calculating Notebook's containerState {"notebook": "kubeflow-user-example-com/test123"}
1.6631558760990841e+09 ERROR controllers.Notebook Could not find container with the same name as Notebook in containerStates of Pod. Will not update notebook's status.containerState {"notebook": "kubeflow-user-example-com/test123"}
github.com/kubeflow/kubeflow/components/notebook-controller/controllers.updateNotebookStatus
/workspace/notebook-controller/controllers/notebook_controller.go:289
github.com/kubeflow/kubeflow/components/notebook-controller/controllers.(*NotebookReconciler).Reconcile
/workspace/notebook-controller/controllers/notebook_controller.go:221
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.663155876099124e+09 INFO controllers.Notebook Calculating Notebook's Conditions {"notebook": "kubeflow-user-example-com/test123"}
1.663155876099128e+09 INFO controllers.Notebook Updating Notebook CR Status {"notebook": "kubeflow-user-example-com/test123", "status": {"conditions":[],"readyReplicas":0,"containerState":{}}}
1.6631558761115203e+09 INFO culler Updating the last-activity annotation. {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761115892e+09 INFO culler last-activity annotation exists. Checking /api/kernels {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761161928e+09 INFO culler Warning: GET to http://test123.kubeflow-user-example-com.svc.cluster.local/notebook/kubeflow-user-example-com/test123/api/kernels: 503
1.6631558761163704e+09 INFO culler Could not GET the kernels status. Will not update last-activity. {"notebook": "kubeflow-user-example-com/test123"}
1.663155876116424e+09 INFO controllers.Notebook Notebook kubeflow-user-example-com/test123 needs culling. Setting annotations {"notebook": "kubeflow-user-example-com/test123"}
1.663155876127934e+09 INFO controllers.Notebook Updating StatefulSet {"notebook": "kubeflow-user-example-com/test123", "namespace": "kubeflow-user-example-com", "name": "test123"}
1.6631558761406817e+09 INFO controllers.Notebook Pod not found... {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761407669e+09 INFO controllers.Notebook Initializing Notebook CR Status {"notebook": "kubeflow-user-example-com/test123"}
1.66315587614078e+09 INFO controllers.Notebook Calculating Notebook's containerState {"notebook": "kubeflow-user-example-com/test123"}
1.663155876140785e+09 ERROR controllers.Notebook Could not find container with the same name as Notebook in containerStates of Pod. Will not update notebook's status.containerState {"notebook": "kubeflow-user-example-com/test123"}
github.com/kubeflow/kubeflow/components/notebook-controller/controllers.updateNotebookStatus
/workspace/notebook-controller/controllers/notebook_controller.go:289
github.com/kubeflow/kubeflow/components/notebook-controller/controllers.(*NotebookReconciler).Reconcile
/workspace/notebook-controller/controllers/notebook_controller.go:221
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.6631558761408286e+09 INFO controllers.Notebook Calculating Notebook's Conditions {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761408334e+09 INFO controllers.Notebook Updating Notebook CR Status {"notebook": "kubeflow-user-example-com/test123", "status": {"conditions":[],"readyReplicas":0,"containerState":{}}}
1.6631558761530833e+09 INFO culler Updating the last-activity annotation. {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761531422e+09 INFO culler last-activity annotation exists. Checking /api/kernels {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761573527e+09 INFO culler Warning: GET to http://test123.kubeflow-user-example-com.svc.cluster.local/notebook/kubeflow-user-example-com/test123/api/kernels: 503
1.6631558761574843e+09 INFO culler Could not GET the kernels status. Will not update last-activity. {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761575234e+09 INFO culler Notebook is already stopping {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761578014e+09 INFO controllers.Notebook Updating StatefulSet {"notebook": "kubeflow-user-example-com/test123", "namespace": "kubeflow-user-example-com", "name": "test123"}
1.66315587616976e+09 INFO controllers.Notebook Pod not found... {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761698267e+09 INFO controllers.Notebook Initializing Notebook CR Status {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761698391e+09 INFO controllers.Notebook Calculating Notebook's containerState {"notebook": "kubeflow-user-example-com/test123"}
@kimwnasptd @StefanoFioravanzo is this something for Kubeflow 1.6.1 ?
Unfortunately this is indeed a bug with the current release, and it's reproducible for everyone.
The culprit is the fix we send for setting the status properly in the Notebook Controller https://github.com/kubeflow/kubeflow/pull/6628. Specifically, the problematic lines are the following: https://github.com/kubeflow/kubeflow/blob/8210e0625127510f7ef93b9c355b0335269949ac/components/notebook-controller/controllers/notebook_controller.go#L211-L218
We are not calculating correctly the podFound variable, since a Pod might not be found yet the code will set the var as true. Which results in the culling annotation to not be removed when there is no Pod.
From Notebooks WG, this will need a KF 1.6.1 since it is a serious regression that we let slip in.
I propose the following plan:
- We send a minimal PR to get culling to work again, and cherry-pick to the release branch and have a new release
- Send a subsequent PR in
masterand introduce more robust integration tests to try and catch these culling scenarios in submitted PRs
The lesson learnt, for me at least, is to include only fixes that solve p0 issues during feature freeze, and aim for the minimum changes possible.
cc @annajung