volcano icon indicating copy to clipboard operation
volcano copied to clipboard

optimizate jobflow controller to reduce invalid reconcile

Open calvin0327 opened this issue 1 year ago • 3 comments

I found a some bit err message when using jobflow feature, I create a jobflow resource ref: https://github.com/volcano-sh/volcano/blob/master/example/jobflow/JobFlow.yaml https://github.com/volcano-sh/volcano/blob/master/example/jobflow/JobTemplate.yaml

here's controller manager logs:

[root@master01 ~]# kubectl logs -n volcano-system volcano-controllers-744bc4796d-jbncj | grep ^E
E0425 10:34:49.690189       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:34:49.707411       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:34:50.321009       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:34:51.395417       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:04.721574       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:04.736015       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:05.568771       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:05.581852       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:20.711708       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:21.731150       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:34.692296       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-b, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-b> is not ready
E0425 10:35:34.695945       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-b, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-b> is not ready
E0425 10:35:34.698687       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-c, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-c> is not ready
E0425 10:35:34.701790       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-c, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-c> is not ready
E0425 10:35:34.707817       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-d, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-d> is not ready
E0425 10:35:34.712693       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-d, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-d> is not ready
E0425 10:35:34.714371       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-e, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-e> is not ready
E0425 10:35:34.715187       1 jobflow_controller_action.go:300] Failed to delete job of JobFlow default/test: jobs.batch.volcano.sh "test-a" not found
E0425 10:35:34.715210       1 jobflow_controller_action.go:46] Failed to delete jobs of JobFlow default/test: jobs.batch.volcano.sh "test-a" not found
E0425 10:35:34.717377       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-e, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-e> is not ready
E0425 10:35:34.723456       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-a, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-a> is not ready
E0425 10:35:34.728548       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-a, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: failed to find job <default/test-a>

The pr focuses only on jobflow_controllers.go errors.

calvin0327 avatar Apr 26 '24 08:04 calvin0327

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign shinytang6 You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Apr 26 '24 08:04 volcano-sh-bot

/auto-cc

calvin0327 avatar Apr 26 '24 09:04 calvin0327

@lowang-bh @hwdef PTAL

calvin0327 avatar Apr 28 '24 01:04 calvin0327