volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Bug in volcano scheduler metrics for queue pod group count

Open hy00nc opened this issue 2 years ago • 7 comments

What happened:

When there is only one job (pod group) running on a queue, and if that job is deleted the metrics does not get updated. That said, queue_pod_group_running_count is 1 when a job is running, and if that job is deleted for some reason, it still remains 1, not updated to 0. Same for other metrics like queue_pod_group_pending_count..

What you expected to happen:

I expect the queue_pod_group_running_count turn to 0, since the running job has been deleted.

How to reproduce it (as minimally and precisely as possible):

Remove all the existing volcano job or pod groups if any, then follow the instructions below:

  1. Try creating a VolcanoJob that sleeps infinitely -> Check if the queue_pod_group_running_count has been updated to 1.
  2. Try deleting that VolcanoJob -> Check the queue_pod_group_running_count if it is still 1.

Environment:

  • Volcano Version: v1.8.0

hy00nc avatar Jan 08 '24 02:01 hy00nc

Seems the metrics update is not triggered when there is zero job in here. Perhaps we have to add a code to set default value zero for all the queue-related metrics and then trigger the metrics update if there is any job.

hy00nc avatar Jan 08 '24 02:01 hy00nc

Seems the metrics update is not triggered when there is zero job

Yeah,current code can not trigger metrics update if no jobs in queue.

lowang-bh avatar Jan 08 '24 04:01 lowang-bh

Seems the metrics update is not triggered when there is zero job

Yeah,current code can not trigger metrics update if no jobs in queue.

Maybe we can move these codes to OnSessionClose to cover all queues' metrics. https://github.com/volcano-sh/volcano/blob/67cabf78a6a50751287eecedd8e050c5977ebb40/pkg/scheduler/plugins/proportion/proportion.go#L168-L178

Monokaix avatar Jan 08 '24 06:01 Monokaix

It will still has same problems, because those queues who has no jobs in them will not be iterated using for _, attr := range pp.queueOpts

lowang-bh avatar Jan 09 '24 01:01 lowang-bh

It will still has same problems, because those queues who has no jobs in them will not be iterated using for _, attr := range pp.queueOpts

If no job in queue, it will be zero, right?

Monokaix avatar Jan 09 '24 02:01 Monokaix

It will still has same problems, because those queues who has no jobs in them will not be iterated using for _, attr := range pp.queueOpts

If no job in queue, it will be zero, right?

no, those queues aren't included in pp.queueOpts.

lowang-bh avatar Jan 10 '24 04:01 lowang-bh

It will still has same problems, because those queues who has no jobs in them will not be iterated using for _, attr := range pp.queueOpts

If no job in queue, it will be zero, right?

no, those queues aren't included in pp.queueOpts.

I mean that queue not present in pp.queueOpts indicates its job number is zero and we can update them to zero directly.

Monokaix avatar Jan 10 '24 09:01 Monokaix