volcano icon indicating copy to clipboard operation
volcano copied to clipboard

feat: add volcano jobs phase metric

Open Prepmachine4 opened this issue 1 year ago • 51 comments

fix: #2493

I add all job phase shows in job apis.

const (
	// Pending is the phase that job is pending in the queue, waiting for scheduling decision
	Pending JobPhase = "Pending"
	// Aborting is the phase that job is aborted, waiting for releasing pods
	Aborting JobPhase = "Aborting"
	// Aborted is the phase that job is aborted by user or error handling
	Aborted JobPhase = "Aborted"
	// Running is the phase that minimal available tasks of Job are running
	Running JobPhase = "Running"
	// Restarting is the phase that the Job is restarted, waiting for pod releasing and recreating
	Restarting JobPhase = "Restarting"
	// Completing is the phase that required tasks of job are completed, job starts to clean up
	Completing JobPhase = "Completing"
	// Completed is the phase that all tasks of Job are completed
	Completed JobPhase = "Completed"
	// Terminating is the phase that the Job is terminated, waiting for releasing pods
	Terminating JobPhase = "Terminating"
	// Terminated is the phase that the job is finished unexpected, e.g. events
	Terminated JobPhase = "Terminated"
	// Failed is the phase that the job is restarted failed reached the maximum number of retries.
	Failed JobPhase = "Failed"
)

The metric record event happend in jobInformer.Informer() received event.

cc.jobInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
	AddFunc:    cc.addJob,
	UpdateFunc: cc.updateJob,
	DeleteFunc: cc.deleteJob,
})

And the processing of record metrics is not use cache lock that maybe produces some inaccurate data but will improve some performance.

But if it traverses all jobs every time during updates, I think there might be some trouble. Should we switch to incremental updates or scheduled metrics at intervals?

Prepmachine4 avatar Aug 03 '24 09:08 Prepmachine4

This implementation is too complex; we just need to add metrics at the state transition points.

googs1025 avatar Aug 03 '24 10:08 googs1025

In practical scenarios, we only focus on the final states of success and failure, and we are not concerned with or do not need other states because they are not final states.

googs1025 avatar Aug 03 '24 10:08 googs1025

You just need to find the place where the controller last modified the status in these two related places and make the changes. status.State.Phase = vcbatch.Completed status.State.Phase = vcbatch.Failed

googs1025 avatar Aug 03 '24 10:08 googs1025

/assign

googs1025 avatar Aug 03 '24 10:08 googs1025

You just need to find the place where the controller last modified the status in these two related places and make the changes. status.State.Phase = vcbatch.Completed status.State.Phase = vcbatch.Failed

ok

Prepmachine4 avatar Aug 03 '24 11:08 Prepmachine4

image image

I tested the metrics, and they were in line with expectations.

Prepmachine4 avatar Aug 04 '24 15:08 Prepmachine4

squash the commit to one

googs1025 avatar Aug 04 '24 23:08 googs1025

squash the commit to one

If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?

Prepmachine4 avatar Aug 05 '24 06:08 Prepmachine4

squash the commit to one

If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?

Why do we need to resubmit a PR when using git rebase or git squash? This is to update your git log view locally, and has nothing to do with the PR itself.

googs1025 avatar Aug 05 '24 06:08 googs1025

squash the commit to one

If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?

Why do we need to resubmit a PR when using git rebase or git squash? This is to update your git log view locally, and has nothing to do with the PR itself.

ok ,i have squashed the commit to one.

Prepmachine4 avatar Aug 05 '24 06:08 Prepmachine4

Thanks for you contribution! It changed a lot cause vc controller also introduced metrics.

Monokaix avatar Aug 08 '24 03:08 Monokaix

Thanks for you contribution! It changed a lot cause vc controller also introduced metrics.

What I want to say is that can we just put all metrics in vc scheduler, put metrics into both scheduler and controller is not a good way.

Monokaix avatar Aug 08 '24 07:08 Monokaix

squash the commit to one

If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?

Why do we need to resubmit a PR when using git rebase or git squash? This is to update your git log view locally, and has nothing to do with the PR itself.

ok ,i have squashed the commit to one.

squashed to one commit. If you want to merge the master branch, please use rebase, which will allow reviewers to review better.

googs1025 avatar Aug 08 '24 08:08 googs1025

/reopen

Monokaix avatar Aug 08 '24 09:08 Monokaix

@Monokaix: Failed to re-open PR: state cannot be changed. There are no new commits on the Prepmachine4:master branch.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot avatar Aug 08 '24 09:08 volcano-sh-bot

Thanks for you contribution! It changed a lot cause vc controller also introduced metrics.

What I want to say is that can we just put all metrics in vc scheduler, put metrics into both scheduler and controller is not a good way.

My initial implementation is like this, but it's a little bit complicated.That's why it's implemented in controller instead.So what am I supposed to do? @googs1025

Prepmachine4 avatar Aug 08 '24 09:08 Prepmachine4

I think the key point of this discussion is: which component should be responsible for changing the status? Since the controller is the component that job status is controlled by, it should be the one sending this variable. The scheduler itself doesn't care about the status of jobs or other workloads, which is why I believe it's not suitable to place it in the scheduler. Additionally, each component in Kubernetes has its own metrics.

googs1025 avatar Aug 08 '24 11:08 googs1025

What I want to say is that can we just put all metrics in vc scheduler, put metrics into both scheduler and controller is not a good way.

Agree. Controller is just responsible for volcano job. Scheduler is responsible for all kinds workloads and it will update job status represent in podgroup.

lowang-bh avatar Aug 08 '24 13:08 lowang-bh

Since the controller is the component that job status is controlled by, it should be the one sending this variable.

It can only cover vc jobs.

The scheduler itself doesn't care about the status of jobs or other workload

Sheduler can skip those already running jobs.

lowang-bh avatar Aug 08 '24 13:08 lowang-bh

ok, I will change the code again.

Prepmachine4 avatar Aug 08 '24 15:08 Prepmachine4

ok, I will change the code again.

Okay, I'm just putting my two cents in here. My reasoning is solely based on the approaches taken by the main Kubernetes repository or other SIGs' controllers. However, other approvers might have different viewpoints. It might not just be about architecture design, as introducing a new data source for the controller could involve adapting the existing monitoring systems, which might be another reason why we wouldn't want to add this to the controller. I'm sorry for any inconvenience this may cause you.

googs1025 avatar Aug 09 '24 02:08 googs1025

ok, I will change the code again.

Okay, I'm just putting my two cents in here. My reasoning is solely based on the approaches taken by the main Kubernetes repository or other SIGs' controllers. However, other approvers might have different viewpoints. It might not just be about architecture design, as introducing a new data source for the controller could involve adapting the existing monitoring systems, which might be another reason why we wouldn't want to add this to the controller. I'm sorry for any inconvenience this may cause you.

It doesn't matter, I'm also familiar with the controller component, and maybe it will helpful for the future contribution. Anyway, thanks for your advice.

Prepmachine4 avatar Aug 09 '24 03:08 Prepmachine4

Could we push forward with this PR? I have a PR depending on metrics, so I wish this PR can be merged asap

JesseStutler avatar Sep 26 '24 04:09 JesseStutler

Could we push forward with this PR? I have a PR depending on metrics, so I wish this PR can be merged asap

Well, I was busy a few days ago. I will finish this PR asap

Prepmachine4 avatar Sep 27 '24 03:09 Prepmachine4

The scheduler open session every second, and podgroup is deleted when it becomes completed, making it impossible for the scheduler to sense whether the job succeeded or failed.

Do I need to directly pull jobs from apiserver to collect metrics? I don't think this is a reasonable way to implement, please give me some advice.

image
func (ju *jobUpdater) updateJob(index int) {
	job := ju.jobQueue[index]
	ssn := ju.ssn

	job.PodGroup.Status = jobStatus(ssn, job)
	oldStatus, found := ssn.podGroupStatus[job.UID]
	updatePG := !found || isPodGroupStatusUpdated(job.PodGroup.Status, oldStatus)
	klog.Infof("oldStatus: %v, newStatus: %v, updatePG: %v", oldStatus, job.PodGroup.Status, updatePG)
	if _, err := ssn.cache.UpdateJobStatus(job, updatePG); err != nil {
		klog.Errorf("Failed to update job <%s/%s>: %v",
			job.Namespace, job.Name, err)
	}
}

Prepmachine4 avatar Oct 08 '24 08:10 Prepmachine4

I think we don't need to record the metrics here in the scheduler @Monokaix? We currently intend to record these metrics in vc-controller, see #3751. Currently only capacity and proportion will expose these podgroup pending/inqueue/unknown/running count metrics outside, but these counts need to be persisted to queue status, but we're not going to persist these statistics anymore, and I don't think we should record the failed and completed metrics separately in scheduler.

JesseStutler avatar Oct 08 '24 12:10 JesseStutler

I think we don't need to record the metrics here in the scheduler @Monokaix? We currently intend to record these metrics in vc-controller, see #3751. Currently only capacity and proportion will expose these podgroup pending/inqueue/unknown/running count metrics outside, but these counts need to be persisted to queue status, but we're not going to persist these statistics anymore, and I don't think we should record the failed and completed metrics separately in scheduler.

more context we have discussed https://github.com/volcano-sh/volcano/pull/3650#issuecomment-2275601180

googs1025 avatar Oct 08 '24 12:10 googs1025

@Monokaix What do you think about this.

Prepmachine4 avatar Oct 11 '24 03:10 Prepmachine4

I went back to the way I used to get metric in the controller, so please review the code again.

Prepmachine4 avatar Oct 11 '24 08:10 Prepmachine4

Hi, thanks for your contributions! Please fix the CI

hwdef avatar Oct 11 '24 15:10 hwdef