volcano icon indicating copy to clipboard operation
volcano copied to clipboard

feat: add volcano jobs phase metric

Open Prepmachine4 opened this issue 6 months ago • 51 comments

fix: #2493

I add all job phase shows in job apis.

const (
	// Pending is the phase that job is pending in the queue, waiting for scheduling decision
	Pending JobPhase = "Pending"
	// Aborting is the phase that job is aborted, waiting for releasing pods
	Aborting JobPhase = "Aborting"
	// Aborted is the phase that job is aborted by user or error handling
	Aborted JobPhase = "Aborted"
	// Running is the phase that minimal available tasks of Job are running
	Running JobPhase = "Running"
	// Restarting is the phase that the Job is restarted, waiting for pod releasing and recreating
	Restarting JobPhase = "Restarting"
	// Completing is the phase that required tasks of job are completed, job starts to clean up
	Completing JobPhase = "Completing"
	// Completed is the phase that all tasks of Job are completed
	Completed JobPhase = "Completed"
	// Terminating is the phase that the Job is terminated, waiting for releasing pods
	Terminating JobPhase = "Terminating"
	// Terminated is the phase that the job is finished unexpected, e.g. events
	Terminated JobPhase = "Terminated"
	// Failed is the phase that the job is restarted failed reached the maximum number of retries.
	Failed JobPhase = "Failed"
)

The metric record event happend in jobInformer.Informer() received event.

cc.jobInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
	AddFunc:    cc.addJob,
	UpdateFunc: cc.updateJob,
	DeleteFunc: cc.deleteJob,
})

And the processing of record metrics is not use cache lock that maybe produces some inaccurate data but will improve some performance.

But if it traverses all jobs every time during updates, I think there might be some trouble. Should we switch to incremental updates or scheduled metrics at intervals?

Prepmachine4 avatar Aug 03 '24 09:08 Prepmachine4