training-operator
training-operator copied to clipboard
Flaky test: [It] should delete job when expired time is up
------------------------------
• [FAILED] [0.017 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528
Timeline >>
STEP: preparing cases succeeded job with TTL 3s @ 05/31/23 14:16:41.447
STEP: creating a TFJob @ 05/31/23 14:16:41.447
STEP: getting a created TFJob @ 05/31/23 14:16:41.451
STEP: prepare pod @ 05/31/23 14:16:41.451
STEP: update job replica statuses @ 05/31/23 14:16:41.451
STEP: update job status @ 05/31/23 14:16:41.451
STEP: updating job status... @ 05/31/23 14:16:41.451
2023-05-31T14:16:41Z DEBUG events TFJob default/test-bof-0 successfully completed. {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "TFJobSucceeded"}
2023-05-31T14:16:41Z DEBUG events Created pod: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "SuccessfulCreatePod"}
2023-05-31T14:16:41Z DEBUG events Created service: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"54b81f4e-3d97-4817-9089-cb45cbbbc57d","apiVersion":"kubeflow.org/v1","resourceVersion":"540"}, "reason": "SuccessfulCreateService"}
[FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 05/31/23 14:16:41.464
<< Timeline
[FAILED] Expected success, but got an error:
<*errors.StatusError | 0xc001988960>: {
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again",
Reason: "Conflict",
Details: {
Name: "test-bof-0",
Group: "kubeflow.org",
Kind: "tfjobs",
UID: "",
Causes: nil,
RetryAfterSeconds: 0,
},
Code: 409,
},
}
Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-0": the object has been modified; please apply your changes to the latest version and try again
In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 05/31/23 14:16:41.464
------------------------------
https://github.com/kubeflow/training-operator/actions/runs/5133950363/jobs/9237255986#step:4:208
Similar flaky test:
------------------------------
• [FAILED] [3.037 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:528
Timeline >>
STEP: preparing cases succeeded job with TTL 3s @ 07/03/23 15:40:21.929
STEP: creating a TFJob @ 07/03/23 15:40:21.929
STEP: getting a created TFJob @ 07/03/23 15:40:21.933
STEP: prepare pod @ 07/03/23 15:40:21.933
STEP: update job replica statuses @ 07/03/23 15:40:21.933
STEP: update job status @ 07/03/23 15:40:21.933
STEP: updating job status... @ 07/03/23 15:40:21.933
2023-07-03T15:40:21Z DEBUG events TFJob default/test-bof-0 successfully completed. {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-[483](https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:484)c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "TFJobSucceeded"}
2023-07-03T15:40:21Z DEBUG events Created pod: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "SuccessfulCreatePod"}
2023-07-03T15:40:21Z DEBUG events Created service: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"751"}, "reason": "SuccessfulCreateService"}
STEP: waiting for updating replicaStatus for workers @ 07/03/23 15:40:21.943
2023-07-03T15:40:21Z ERROR Reconciler error {"controller": "tfjob-controller", "object": {"name":"test-bof-0","namespace":"default"}, "namespace": "default", "name": "test-bof-0", "reconcileID": "ba1c3f09-182a-4c0e-a33f-38290f7a64db", "error": "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2023-07-03T15:40:22Z ERROR Reconciler error {"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-vbllp"}, "namespace": "tfjob-ns-vbllp", "name": "test-tfjob", "reconcileID": "981bba90-db95-4994-8374-7299bdf7d9dd", "error": "unable to create services: services \"test-tfjob-chief-0\" is forbidden: unable to create new content in namespace tfjob-ns-vbllp because it is being terminated"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2023-07-03T15:40:22Z DEBUG events Error creating: services "test-tfjob-chief-0" is forbidden: unable to create new content in namespace tfjob-ns-vbllp because it is being terminated {"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-vbllp","name":"test-tfjob","uid":"9a2f3de8-890c-4071-a0ca-40a13fed22e8","apiVersion":"kubeflow.org/v1","resourceVersion":"372"}, "reason": "FailedCreateService"}
2023-07-03T15:40:23Z ERROR Reconciler error {"controller": "tfjob-controller", "object": {"name":"test-tfjob","namespace":"tfjob-ns-97b2d"}, "namespace": "tfjob-ns-97b2d", "name": "test-tfjob", "reconcileID": "9b5e3b4d-1154-4962-8a4d-a787579c87c0", "error": "pods \"test-tfjob-worker-0\" is forbidden: unable to create new content in namespace tfjob-ns-97b2d because it is being terminated"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/home/runner/work/training-operator/training-operator/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2023-07-03T15:40:23Z DEBUG events Error creating: pods "test-tfjob-worker-0" is forbidden: unable to create new content in namespace tfjob-ns-97b2d because it is being terminated {"type": "Warning", "object": {"kind":"TFJob","namespace":"tfjob-ns-97b2d","name":"test-tfjob","uid":"d9e1b604-4d09-4118-bc2c-70107156a8a5","apiVersion":"kubeflow.org/v1","resourceVersion":"376"}, "reason": "FailedCreatePod"}
2023-07-03T15:40:24Z DEBUG events Deleted job: test-bof-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"9ae58d27-8c03-483c-bcb9-075186e331c5","apiVersion":"kubeflow.org/v1","resourceVersion":"755"}, "reason": "SuccessfulDeleteJob"}
2023-07-03T15:40:24Z INFO TFJob.kubeflow.org "test-bof-0" not found {"tfjob": {"name":"test-bof-0","namespace":"default"}, "unable to fetch TFJob": "default/test-bof-0"}
2023-07-03T15:40:24Z INFO TFJob.kubeflow.org "test-bof-0" not found {"tfjob": {"name":"test-bof-0","namespace":"default"}, "unable to fetch TFJob": "default/test-bof-0"}
STEP: preparing cases failed job with TTL 3s @ 07/03/23 15:40:24.944
STEP: creating a TFJob @ 07/03/23 15:40:24.944
STEP: getting a created TFJob @ 07/03/23 15:40:24.949
STEP: prepare pod @ 07/03/23 15:40:24.949
STEP: update job replica statuses @ 07/03/23 15:40:24.949
STEP: update job status @ 07/03/23 15:40:24.949
STEP: updating job status... @ 07/03/23 15:40:24.949
2023-07-03T15:40:24Z DEBUG events TFJob default/test-bof-1 has failed because 1 Worker replica(s) failed. {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-[497](https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:498)c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "TFJobFailed"}
2023-07-03T15:40:24Z DEBUG events Created pod: test-bof-1-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-497c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "SuccessfulCreatePod"}
2023-07-03T15:40:24Z DEBUG events Created service: test-bof-1-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-1","uid":"1e0d4ced-bbc1-497c-b47b-cc3abc6633ce","apiVersion":"kubeflow.org/v1","resourceVersion":"760"}, "reason": "SuccessfulCreateService"}
[FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 07/03/23 15:40:24.966
<< Timeline
[FAILED] Expected success, but got an error:
<*errors.StatusError | 0xc0004e2f00>:
Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-1": the object has been modified; please apply your changes to the latest version and try again
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-1\": the object has been modified; please apply your changes to the latest version and try again",
Reason: "Conflict",
Details: {
Name: "test-bof-1",
Group: "kubeflow.org",
Kind: "tfjobs",
UID: "",
Causes: nil,
RetryAfterSeconds: 0,
},
Code: 409,
},
}
In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:579 @ 07/03/23 15:40:24.966
------------------------------
https://github.com/kubeflow/training-operator/actions/runs/5446286573/jobs/9906794119?pr=1843#step:4:480
Similar flaky test:
------------------------------
• [FAILED] [0.022 seconds]
TFJob controller Test TTL Seconds After Finished [It] should delete job when expired time is up
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:525
Timeline >>
STEP: preparing cases succeeded job with TTL 3s @ 07/04/23 22:10:21.22
STEP: creating a TFJob @ 07/04/23 22:10:21.22
STEP: getting a created TFJob @ 07/04/23 22:10:21.225
STEP: prepare pod @ 07/04/23 22:10:21.225
STEP: update job replica statuses @ 07/04/23 22:10:21.225
STEP: update job status @ 07/04/23 22:10:21.225
STEP: updating job status... @ 07/04/23 22:10:21.225
2023-07-04T22:10:21Z DEBUG events TFJob default/test-bof-0 successfully completed. {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "TFJobSucceeded"}
2023-07-04T22:10:21Z DEBUG events Created pod: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "SuccessfulCreatePod"}
2023-07-04T22:10:21Z DEBUG events Created service: test-bof-0-worker-0 {"type": "Normal", "object": {"kind":"TFJob","namespace":"default","name":"test-bof-0","uid":"80e11775-0312-4aa7-a6f3-6c25927dc4d6","apiVersion":"kubeflow.org/v1","resourceVersion":"1143"}, "reason": "SuccessfulCreateService"}
[FAILED] in [It] - /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:576 @ 07/04/23 22:10:21.241
<< Timeline
[FAILED] Expected success, but got an error:
<*errors.StatusError | 0xc0001546e0>:
Operation cannot be fulfilled on tfjobs.kubeflow.org "test-bof-0": the object has been modified; please apply your changes to the latest version and try again
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Operation cannot be fulfilled on tfjobs.kubeflow.org \"test-bof-0\": the object has been modified; please apply your changes to the latest version and try again",
Reason: "Conflict",
Details: {
Name: "test-bof-0",
Group: "kubeflow.org",
Kind: "tfjobs",
UID: "",
Causes: nil,
RetryAfterSeconds: 0,
},
Code: 409,
},
}
In [It] at: /home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/job_test.go:576 @ 07/04/23 22:10:21.241
https://github.com/kubeflow/training-operator/actions/runs/5458679683/jobs/9934001719?pr=1849#step:4:793
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen