training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

Flaky test: [It] should delete job when expired time is up

Open tenzen-y opened this issue 2 years ago • 4 comments

------------------------------
• [FAILED] [0.017 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528

  Timeline >>
  STEP: preparing cases succeeded job with TTL 3s @ 05/31/23 14:16:41.447
  STEP: creating a TFJob @ 05/31/23 14:16:41.447
  STEP: getting a created TFJob @ 05/31/23 14:16:41.451
  STEP: prepare pod @ 05/31/23 14:16:41.451
  STEP: update job replica statuses @ 05/31/23 14:16:41.451
  STEP: update job status @ 05/31/23 14:16:41.451
  STEP: updating job status... @ 05/31/23 14:16:41.451
  2023-05-31T14:16:41Z	DEBUG	events	TFJob default/test-bof-0 successfully completed.	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "TFJobSucceeded"}
  2023-05-31T14:16:41Z	DEBUG	events	Created pod: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "SuccessfulCreatePod"}
  2023-05-31T14:16:41Z	DEBUG	events	Created service: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "SuccessfulCreateService"}
  [FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 05/31/23 14:16:41.464
  << Timeline

  [FAILED] Expected success, but got an error:
      <*errors.StatusError | 0xc001988960>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again",
              Reason: "Conflict",
              Details: {
                  Name: "test-bof-0",
                  Group: "kubeflow.org",
                  Kind: "tfjobs",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
      Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-0": the object has been modified; please apply your changes to the latest version and try again
  In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 05/31/23 14:16:41.464
------------------------------

https://github.com/kubeflow/training-operator/actions/runs/5133950363/jobs/9237255986#step:4:208

tenzen-y avatar May 31 '23 17:05 tenzen-y

Similar flaky test:

------------------------------
• [FAILED] [3.037 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528

  Timeline >>
  STEP: preparing cases succeeded job with TTL 3s @ 07/03/23 15:40:21.929
  STEP: creating a TFJob @ 07/03/23 15:40:21.929
  STEP: getting a created TFJob @ 07/03/23 15:40:21.933
  STEP: prepare pod @ 07/03/23 15:40:21.933
  STEP: update job replica statuses @ 07/03/23 15:40:21.933
  STEP: update job status @ 07/03/23 15:40:21.933
  STEP: updating job status... @ 07/03/23 15:40:21.933
  2023-07-03T15:40:21Z	DEBUG	events	TFJob default/test-bof-0 successfully completed.	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-[483](https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:484)c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "TFJobSucceeded"}
  2023-07-03T15:40:21Z	DEBUG	events	Created pod: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "SuccessfulCreatePod"}
  2023-07-03T15:40:21Z	DEBUG	events	Created service: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "SuccessfulCreateService"}
  STEP: waiting for updating replicaStatus for workers @ 07/03/23 15:40:21.943
  2023-07-03T15:40:21Z	ERROR	Reconciler error	{"controller": "tfjob-controller", "object": {"name":"test-bof-0","namespace":"default"}, "namespace": "default", "name": "test-bof-0", "reconcileID": "ba1c3f09-182a-4c0e-a33f-38290f7a64db", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
  2023-07-03T15:40:22Z	ERROR	Reconciler error	{"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-vbllp"}, "namespace": "tfjob-ns-vbllp", "name": "test-tfjob", "reconcileID": "981bba90-db95-4994-8374-7299bdf7d9dd", "error": "unable to create services: services \"test-tfjob-chief-0\" is forbidden: unable to create new content in namespace tfjob-ns-vbllp because it is being terminated"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
  2023-07-03T15:40:22Z	DEBUG	events	Error creating: services "test-tfjob-chief-0" is forbidden: unable to create new content in namespace tfjob-ns-vbllp because it is being terminated	{"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-vbllp","name":"test-tfjob","uid":"9a2f3de8-890c-4071-a0ca-40a13fed22e8","apiVersion":"kubeflow.org/v1","resourceVersion":"372"}, "reason": "FailedCreateService"}
  2023-07-03T15:40:23Z	ERROR	Reconciler error	{"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-97b2d"}, "namespace": "tfjob-ns-97b2d", "name": "test-tfjob", "reconcileID": "9b5e3b4d-1154-4962-8a4d-a787579c87c0", "error": "pods \"test-tfjob-worker-0\" is forbidden: unable to create new content in namespace tfjob-ns-97b2d because it is being terminated"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
  	/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
  2023-07-03T15:40:23Z	DEBUG	events	Error creating: pods "test-tfjob-worker-0" is forbidden: unable to create new content in namespace tfjob-ns-97b2d because it is being terminated	{"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-97b2d","name":"test-tfjob","uid":"d9e1b604-4d09-4118-bc2c-70107156a8a5","apiVersion":"kubeflow.org/v1","resourceVersion":"376"}, "reason": "FailedCreatePod"}
  2023-07-03T15:40:24Z	DEBUG	events	Deleted job: test-bof-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"755"}, "reason": "SuccessfulDeleteJob"}
  2023-07-03T15:40:24Z	INFO	TFJob.kubeflow.org "test-bof-0" not found	{"tfjob": {"name":"test-bof-0","namespace":"default"}, "unable to fetch TFJob": "default/test-bof-0"}
  2023-07-03T15:40:24Z	INFO	TFJob.kubeflow.org "test-bof-0" not found	{"tfjob": {"name":"test-bof-0","namespace":"default"}, "unable to fetch TFJob": "default/test-bof-0"}
  STEP: preparing cases failed job with TTL 3s @ 07/03/23 15:40:24.944
  STEP: creating a TFJob @ 07/03/23 15:40:24.944
  STEP: getting a created TFJob @ 07/03/23 15:40:24.949
  STEP: prepare pod @ 07/03/23 15:40:24.949
  STEP: update job replica statuses @ 07/03/23 15:40:24.949
  STEP: update job status @ 07/03/23 15:40:24.949
  STEP: updating job status... @ 07/03/23 15:40:24.949
  2023-07-03T15:40:24Z	DEBUG	events	TFJob default/test-bof-1 has failed because 1 Worker replica(s) failed.	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-[497](https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:498)c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "TFJobFailed"}
  2023-07-03T15:40:24Z	DEBUG	events	Created pod: test-bof-1-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-497c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "SuccessfulCreatePod"}
  2023-07-03T15:40:24Z	DEBUG	events	Created service: test-bof-1-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-497c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "SuccessfulCreateService"}
  [FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 07/03/23 15:40:24.966
  << Timeline

  [FAILED] Expected success, but got an error:
      <*errors.StatusError | 0xc0004e2f00>: 
      Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-1": the object has been modified; please apply your changes to the latest version and try again
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-1\": the object has been modified; please apply your changes to the latest version and try again",
              Reason: "Conflict",
              Details: {
                  Name: "test-bof-1",
                  Group: "kubeflow.org",
                  Kind: "tfjobs",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
  In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 07/03/23 15:40:24.966
------------------------------

https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:480

tenzen-y avatar Jul 03 '23 15:07 tenzen-y

Similar flaky test:

------------------------------
• [FAILED] [0.022 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:525

  Timeline >>
  STEP: preparing cases succeeded job with TTL 3s @ 07/04/23 22:10:21.22
  STEP: creating a TFJob @ 07/04/23 22:10:21.22
  STEP: getting a created TFJob @ 07/04/23 22:10:21.225
  STEP: prepare pod @ 07/04/23 22:10:21.225
  STEP: update job replica statuses @ 07/04/23 22:10:21.225
  STEP: update job status @ 07/04/23 22:10:21.225
  STEP: updating job status... @ 07/04/23 22:10:21.225
  2023-07-04T22:10:21Z	DEBUG	events	TFJob default/test-bof-0 successfully completed.	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "TFJobSucceeded"}
  2023-07-04T22:10:21Z	DEBUG	events	Created pod: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "SuccessfulCreatePod"}
  2023-07-04T22:10:21Z	DEBUG	events	Created service: test-bof-0-worker-0	{"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "SuccessfulCreateService"}
  [FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:576 @ 07/04/23 22:10:21.241
  << Timeline

  [FAILED] Expected success, but got an error:
      <*errors.StatusError | 0xc0001546e0>: 
      Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-0": the object has been modified; please apply your changes to the latest version and try again
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again",
              Reason: "Conflict",
              Details: {
                  Name: "test-bof-0",
                  Group: "kubeflow.org",
                  Kind: "tfjobs",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 409,
          },
      }
  In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:576 @ 07/04/23 22:10:21.241

https://github.com/kubeflow/training-operator/actions/runs/5458679683/jobs/9934001719?pr=1849#step:4:793

tenzen-y avatar Jul 04 '23 22:07 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 03 '23 00:10 github-actions[bot]

/lifecycle frozen

tenzen-y avatar Oct 03 '23 04:10 tenzen-y