kubeflow icon indicating copy to clipboard operation
kubeflow copied to clipboard

Culling prevents re-starting culled Notebooks

Open TobiasGoerke opened this issue 3 years ago • 3 comments

The issue

With Kubeflow 1.6, we've repeatedly observed Notebooks wouldn't start anymore after they've been culled. Apparently, the notebook controller writes the last-activity annotation before culling the Notebook, however, doesn't remove this annotation before start. This causes the Notebook to be culled again before is has a chance to start.

So: Culling is currently broken and not usable.

How to reproduce

  1. Clone Kubeflow's v1.6-branch
  2. Enable culling by appending the following patch to example/kustomization.yaml:
patchesStrategicMerge:
  - |
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: notebook-controller-deployment
      namespace: kubeflow
    spec:
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "true"
        spec:
          containers:
          - name: manager
            imagePullPolicy: IfNotPresent
            env:
              - name: ADD_FSGROUP
                value: "false"
              - name: ENABLE_CULLING
                value: "true"
              - name: CULL_IDLE_TIME
                value: "360"
              - name: IDLENESS_CHECK_PERIOD
                value: "10"
  1. Start a Notebook using any name or image
  2. Simulate culling by stopping the notebook using Kubeflow's UI. Then run: kubectl -n kubeflow-user-example-com annotate notebook --overwrite --all notebooks.kubeflow.org/last-activity=2022-01-01T00:00:00Z. Alternatively, reduce CULL_IDLE_TIME to some low value like "3".
  3. The notebook won't start anymore. You'll see the following notebook controller logs:
1.6631558760852754e+09  INFO    controllers.Notebook    Updating StatefulSet    {"notebook": "kubeflow-user-example-com/test123", "namespace": "kubeflow-user-example-com", "name": "test123"}
1.6631558760989938e+09  INFO    controllers.Notebook    Pod not found...        {"notebook": "kubeflow-user-example-com/test123"}
1.6631558760990713e+09  INFO    controllers.Notebook    Initializing Notebook CR Status {"notebook": "kubeflow-user-example-com/test123"}
1.663155876099081e+09   INFO    controllers.Notebook    Calculating Notebook's  containerState  {"notebook": "kubeflow-user-example-com/test123"}
1.6631558760990841e+09  ERROR   controllers.Notebook    Could not find container with the same name as Notebook in containerStates of Pod. Will not update notebook's status.containerState     {"notebook": "kubeflow-user-example-com/test123"}
github.com/kubeflow/kubeflow/components/notebook-controller/controllers.updateNotebookStatus
        /workspace/notebook-controller/controllers/notebook_controller.go:289
github.com/kubeflow/kubeflow/components/notebook-controller/controllers.(*NotebookReconciler).Reconcile
        /workspace/notebook-controller/controllers/notebook_controller.go:221
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.663155876099124e+09   INFO    controllers.Notebook    Calculating Notebook's Conditions       {"notebook": "kubeflow-user-example-com/test123"}
1.663155876099128e+09   INFO    controllers.Notebook    Updating Notebook CR Status     {"notebook": "kubeflow-user-example-com/test123", "status": {"conditions":[],"readyReplicas":0,"containerState":{}}}
1.6631558761115203e+09  INFO    culler  Updating the last-activity annotation.  {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761115892e+09  INFO    culler  last-activity annotation exists. Checking /api/kernels  {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761161928e+09  INFO    culler  Warning: GET to http://test123.kubeflow-user-example-com.svc.cluster.local/notebook/kubeflow-user-example-com/test123/api/kernels: 503
1.6631558761163704e+09  INFO    culler  Could not GET the kernels status. Will not update last-activity.        {"notebook": "kubeflow-user-example-com/test123"}
1.663155876116424e+09   INFO    controllers.Notebook    Notebook kubeflow-user-example-com/test123 needs culling. Setting annotations   {"notebook": "kubeflow-user-example-com/test123"}
1.663155876127934e+09   INFO    controllers.Notebook    Updating StatefulSet    {"notebook": "kubeflow-user-example-com/test123", "namespace": "kubeflow-user-example-com", "name": "test123"}
1.6631558761406817e+09  INFO    controllers.Notebook    Pod not found...        {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761407669e+09  INFO    controllers.Notebook    Initializing Notebook CR Status {"notebook": "kubeflow-user-example-com/test123"}
1.66315587614078e+09    INFO    controllers.Notebook    Calculating Notebook's  containerState  {"notebook": "kubeflow-user-example-com/test123"}
1.663155876140785e+09   ERROR   controllers.Notebook    Could not find container with the same name as Notebook in containerStates of Pod. Will not update notebook's status.containerState     {"notebook": "kubeflow-user-example-com/test123"}
github.com/kubeflow/kubeflow/components/notebook-controller/controllers.updateNotebookStatus
        /workspace/notebook-controller/controllers/notebook_controller.go:289
github.com/kubeflow/kubeflow/components/notebook-controller/controllers.(*NotebookReconciler).Reconcile
        /workspace/notebook-controller/controllers/notebook_controller.go:221
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.6631558761408286e+09  INFO    controllers.Notebook    Calculating Notebook's Conditions       {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761408334e+09  INFO    controllers.Notebook    Updating Notebook CR Status     {"notebook": "kubeflow-user-example-com/test123", "status": {"conditions":[],"readyReplicas":0,"containerState":{}}}
1.6631558761530833e+09  INFO    culler  Updating the last-activity annotation.  {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761531422e+09  INFO    culler  last-activity annotation exists. Checking /api/kernels  {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761573527e+09  INFO    culler  Warning: GET to http://test123.kubeflow-user-example-com.svc.cluster.local/notebook/kubeflow-user-example-com/test123/api/kernels: 503
1.6631558761574843e+09  INFO    culler  Could not GET the kernels status. Will not update last-activity.        {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761575234e+09  INFO    culler  Notebook is already stopping    {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761578014e+09  INFO    controllers.Notebook    Updating StatefulSet    {"notebook": "kubeflow-user-example-com/test123", "namespace": "kubeflow-user-example-com", "name": "test123"}
1.66315587616976e+09    INFO    controllers.Notebook    Pod not found...        {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761698267e+09  INFO    controllers.Notebook    Initializing Notebook CR Status {"notebook": "kubeflow-user-example-com/test123"}
1.6631558761698391e+09  INFO    controllers.Notebook    Calculating Notebook's  containerState  {"notebook": "kubeflow-user-example-com/test123"}

TobiasGoerke avatar Sep 14 '22 12:09 TobiasGoerke

@kimwnasptd @StefanoFioravanzo is this something for Kubeflow 1.6.1 ?

juliusvonkohout avatar Sep 14 '22 12:09 juliusvonkohout

Unfortunately this is indeed a bug with the current release, and it's reproducible for everyone.

The culprit is the fix we send for setting the status properly in the Notebook Controller https://github.com/kubeflow/kubeflow/pull/6628. Specifically, the problematic lines are the following: https://github.com/kubeflow/kubeflow/blob/8210e0625127510f7ef93b9c355b0335269949ac/components/notebook-controller/controllers/notebook_controller.go#L211-L218

We are not calculating correctly the podFound variable, since a Pod might not be found yet the code will set the var as true. Which results in the culling annotation to not be removed when there is no Pod.

kimwnasptd avatar Sep 14 '22 15:09 kimwnasptd

From Notebooks WG, this will need a KF 1.6.1 since it is a serious regression that we let slip in.

I propose the following plan:

  1. We send a minimal PR to get culling to work again, and cherry-pick to the release branch and have a new release
  2. Send a subsequent PR in master and introduce more robust integration tests to try and catch these culling scenarios in submitted PRs

The lesson learnt, for me at least, is to include only fixes that solve p0 issues during feature freeze, and aim for the minimum changes possible.

cc @annajung

kimwnasptd avatar Sep 14 '22 16:09 kimwnasptd