volcano
volcano copied to clipboard
feat: add volcano jobs phase metric
fix: #2493
I add all job phase shows in job apis.
const (
// Pending is the phase that job is pending in the queue, waiting for scheduling decision
Pending JobPhase = "Pending"
// Aborting is the phase that job is aborted, waiting for releasing pods
Aborting JobPhase = "Aborting"
// Aborted is the phase that job is aborted by user or error handling
Aborted JobPhase = "Aborted"
// Running is the phase that minimal available tasks of Job are running
Running JobPhase = "Running"
// Restarting is the phase that the Job is restarted, waiting for pod releasing and recreating
Restarting JobPhase = "Restarting"
// Completing is the phase that required tasks of job are completed, job starts to clean up
Completing JobPhase = "Completing"
// Completed is the phase that all tasks of Job are completed
Completed JobPhase = "Completed"
// Terminating is the phase that the Job is terminated, waiting for releasing pods
Terminating JobPhase = "Terminating"
// Terminated is the phase that the job is finished unexpected, e.g. events
Terminated JobPhase = "Terminated"
// Failed is the phase that the job is restarted failed reached the maximum number of retries.
Failed JobPhase = "Failed"
)
The metric record event happend in jobInformer.Informer() received event.
cc.jobInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: cc.addJob,
UpdateFunc: cc.updateJob,
DeleteFunc: cc.deleteJob,
})
And the processing of record metrics is not use cache lock that maybe produces some inaccurate data but will improve some performance.
But if it traverses all jobs every time during updates, I think there might be some trouble. Should we switch to incremental updates or scheduled metrics at intervals?
This implementation is too complex; we just need to add metrics at the state transition points.
In practical scenarios, we only focus on the final states of success and failure, and we are not concerned with or do not need other states because they are not final states.
You just need to find the place where the controller last modified the status in these two related places and make the changes.
status.State.Phase = vcbatch.Completed
status.State.Phase = vcbatch.Failed
/assign
You just need to find the place where the controller last modified the status in these two related places and make the changes.
status.State.Phase = vcbatch.Completedstatus.State.Phase = vcbatch.Failed
ok
I tested the metrics, and they were in line with expectations.
squash the commit to one
squash the commit to one
If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?
squash the commit to one
If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?
Why do we need to resubmit a PR when using git rebase or git squash? This is to update your git log view locally, and has nothing to do with the PR itself.
squash the commit to one
If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?
Why do we need to resubmit a PR when using git rebase or git squash? This is to update your git log view locally, and has nothing to do with the PR itself.
ok ,i have squashed the commit to one.
Thanks for you contribution! It changed a lot cause vc controller also introduced metrics.
Thanks for you contribution! It changed a lot cause vc controller also introduced metrics.
What I want to say is that can we just put all metrics in vc scheduler, put metrics into both scheduler and controller is not a good way.
squash the commit to one
If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?
Why do we need to resubmit a PR when using git rebase or git squash? This is to update your git log view locally, and has nothing to do with the PR itself.
ok ,i have squashed the commit to one.
squashed to one commit. If you want to merge the master branch, please use rebase, which will allow reviewers to review better.
/reopen
@Monokaix: Failed to re-open PR: state cannot be changed. There are no new commits on the Prepmachine4:master branch.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Thanks for you contribution! It changed a lot cause vc controller also introduced metrics.
What I want to say is that can we just put all metrics in vc scheduler, put metrics into both scheduler and controller is not a good way.
My initial implementation is like this, but it's a little bit complicated.That's why it's implemented in controller instead.So what am I supposed to do? @googs1025
I think the key point of this discussion is: which component should be responsible for changing the status? Since the controller is the component that job status is controlled by, it should be the one sending this variable. The scheduler itself doesn't care about the status of jobs or other workloads, which is why I believe it's not suitable to place it in the scheduler. Additionally, each component in Kubernetes has its own metrics.
What I want to say is that can we just put all metrics in vc scheduler, put metrics into both scheduler and controller is not a good way.
Agree. Controller is just responsible for volcano job. Scheduler is responsible for all kinds workloads and it will update job status represent in podgroup.
Since the controller is the component that job status is controlled by, it should be the one sending this variable.
It can only cover vc jobs.
The scheduler itself doesn't care about the status of jobs or other workload
Sheduler can skip those already running jobs.
ok, I will change the code again.
ok, I will change the code again.
Okay, I'm just putting my two cents in here. My reasoning is solely based on the approaches taken by the main Kubernetes repository or other SIGs' controllers. However, other approvers might have different viewpoints. It might not just be about architecture design, as introducing a new data source for the controller could involve adapting the existing monitoring systems, which might be another reason why we wouldn't want to add this to the controller. I'm sorry for any inconvenience this may cause you.
ok, I will change the code again.
Okay, I'm just putting my two cents in here. My reasoning is solely based on the approaches taken by the main Kubernetes repository or other SIGs' controllers. However, other approvers might have different viewpoints. It might not just be about architecture design, as introducing a new data source for the controller could involve adapting the existing monitoring systems, which might be another reason why we wouldn't want to add this to the controller. I'm sorry for any inconvenience this may cause you.
It doesn't matter, I'm also familiar with the controller component, and maybe it will helpful for the future contribution. Anyway, thanks for your advice.
Could we push forward with this PR? I have a PR depending on metrics, so I wish this PR can be merged asap
Could we push forward with this PR? I have a PR depending on metrics, so I wish this PR can be merged asap
Well, I was busy a few days ago. I will finish this PR asap
The scheduler open session every second, and podgroup is deleted when it becomes completed, making it impossible for the scheduler to sense whether the job succeeded or failed.
Do I need to directly pull jobs from apiserver to collect metrics? I don't think this is a reasonable way to implement, please give me some advice.
func (ju *jobUpdater) updateJob(index int) {
job := ju.jobQueue[index]
ssn := ju.ssn
job.PodGroup.Status = jobStatus(ssn, job)
oldStatus, found := ssn.podGroupStatus[job.UID]
updatePG := !found || isPodGroupStatusUpdated(job.PodGroup.Status, oldStatus)
klog.Infof("oldStatus: %v, newStatus: %v, updatePG: %v", oldStatus, job.PodGroup.Status, updatePG)
if _, err := ssn.cache.UpdateJobStatus(job, updatePG); err != nil {
klog.Errorf("Failed to update job <%s/%s>: %v",
job.Namespace, job.Name, err)
}
}
I think we don't need to record the metrics here in the scheduler @Monokaix? We currently intend to record these metrics in vc-controller, see #3751. Currently only capacity and proportion will expose these podgroup pending/inqueue/unknown/running count metrics outside, but these counts need to be persisted to queue status, but we're not going to persist these statistics anymore, and I don't think we should record the failed and completed metrics separately in scheduler.
I think we don't need to record the metrics here in the scheduler @Monokaix? We currently intend to record these metrics in vc-controller, see #3751. Currently only capacity and proportion will expose these podgroup pending/inqueue/unknown/running count metrics outside, but these counts need to be persisted to queue status, but we're not going to persist these statistics anymore, and I don't think we should record the failed and completed metrics separately in scheduler.
more context we have discussed https://github.com/volcano-sh/volcano/pull/3650#issuecomment-2275601180
@Monokaix What do you think about this.
I went back to the way I used to get metric in the controller, so please review the code again.
Hi, thanks for your contributions! Please fix the CI