volcano icon indicating copy to clipboard operation
volcano copied to clipboard

fix scheduler memory leak

Open Wang-Kai opened this issue 1 year ago • 24 comments

Volcano has been fully implemented in the Bilibili transcoding scenario. As we migrated more tasks to Volcano, we also discovered more issues. Memory leakage is one of the most important issues. As other users have discovered, the memory of the Volcano scheduler gradually increases until it triggers an OOM (Out Of Memory) restart.

image

During the lifecycle of a Pod, there will be multiple update events. Currently, each pod update triggers the destruction and recreation of tasks, and we have found this to be the root cause of the problem. After we switched to reusing task objects, the scheduler's memory stabilized, even after running for a long time.

This fix is also logical, as excessively frequent allocation of new memory can lead to memory fragmentation, and the garbage collector will also be under pressure. Sometimes Go runtime doesn't even have time to merge small memory blocks that are no longer in use.

Wang-Kai avatar Jul 15 '24 04:07 Wang-Kai

/cc

Hello, I would like to confirm if my understanding is correct. The main question is, when updating, instead of deleting the original in-memory object and adding it again, should we update the original object and then add it again? Is that the intended meaning?

googs1025 avatar Jul 16 '24 00:07 googs1025

/cc

Hello, I would like to confirm if my understanding is correct. The main question is, when updating, instead of deleting the original in-memory object and adding it again, should we update the original object and then add it again? Is that the intended meaning?

I am honored to discuss this issue with you.

Yes, it is my meaning. To be precise, we should reuse the existing TaskInfo struct rather than discarding it and creating a new one. Because when the cluster is large enough, there are many pods, and the task state changes rapidly, this will involve a significant amount of memory allocation and deallocation.

Wang-Kai avatar Jul 16 '24 03:07 Wang-Kai

/cc Hello, I would like to confirm if my understanding is correct. The main question is, when updating, instead of deleting the original in-memory object and adding it again, should we update the original object and then add it again? Is that the intended meaning?

I am honored to discuss this issue with you.

Yes, it is my meaning. To be precise, we should reuse the existing TaskInfo struct rather than discarding it and creating a new one. Because when the cluster is large enough, there are many pods, and the task state changes rapidly, this will involve a significant amount of memory allocation and deallocation.

How about use syncpool, and then check weather the metrics have a change?

lowang-bh avatar Jul 16 '24 13:07 lowang-bh

About syncpool , can you describe it in more detail? @lowang-bh

Wang-Kai avatar Jul 17 '24 04:07 Wang-Kai

func (ti *TaskInfo) Clone() *TaskInfo Pod: ti.Pod,

I see in this function, pod is not deepcopied, which may cause garbage collector not collect when pod is deleted while scheduling

molei20021 avatar Jul 19 '24 08:07 molei20021

About syncpool , can you describe it in more detail? @lowang-bh

As JobInfo and TaskInfo are repeatly created and deleted, we can use syncpool to store the unused object and then take it when need instead of creating a new one, so as to reduce gc presure.

lowang-bh avatar Jul 21 '24 07:07 lowang-bh

func (ti *TaskInfo) Clone() *TaskInfo Pod: ti.Pod,

I see in this function, pod is not deepcopied, which may cause garbage collector not collect when pod is deleted while scheduling

But the pod will eventually be GC after one scheduler ended,right? I think it's not the main problem here: )

Monokaix avatar Aug 09 '24 09:08 Monokaix

I'm still a little confused that if this is the root cause, if we stop submitting jobs the GC will not have much preessure and finally the memory will be GC, have you tested this case?

Monokaix avatar Aug 09 '24 09:08 Monokaix

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Feb 01 '25 01:02 stale[bot]

Still need to be considered

hwdef avatar Feb 06 '25 03:02 hwdef

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Apr 25 '25 23:04 stale[bot]

still need

hwdef avatar May 04 '25 18:05 hwdef

I also found the scheduler memory increase problem in our large-scale cluster。 heap_inuse (1)

image

fengruotj avatar Oct 16 '25 08:10 fengruotj

/cc

JesseStutler avatar Oct 16 '25 08:10 JesseStutler

In our case, the scheduler 's go_memstats_alloc_bytes metrics (Number of bytes allocated and still in use) has been rising .

fengruotj avatar Oct 16 '25 09:10 fengruotj

What if we follow the pattern of NewTaskInfo and update only the fields that need updating? We could extract a common function.

Thanks @JesseStutler

In practice, almost any field may change. Therefore, in Bilibili, we use TaskInfo.UpdateByPod(*v1.Pod) to update any field that might change. However, this approach has some drawbacks. For example, when a new field is added to Task, we also need to update the UpdateByPod function. Unfortunately, I can’t think of a better solution for now.

Wang-Kai avatar Nov 04 '25 12:11 Wang-Kai

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Nov 04 '25 13:11 volcano-sh-bot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Nov 04 '25 13:11 volcano-sh-bot

CC @JesseStutler

Hey, could you take a look at the CI process and review my code when you have time?

Wang-Kai avatar Nov 05 '25 11:11 Wang-Kai

/gemini review

JesseStutler avatar Nov 06 '25 08:11 JesseStutler

CC @JesseStutler

Hey, could you take a look at the CI process and review my code when you have time?

@Wang-Kai I think it's because that current logic is not correct because other PR can pass CIs, in my local environment, I use your code build components but also will fail on the e2e testing

JesseStutler avatar Nov 06 '25 08:11 JesseStutler

What if we follow the pattern of NewTaskInfo and update only the fields that need updating? We could extract a common function.

Thanks @JesseStutler

In practice, almost any field may change. Therefore, in Bilibili, we use TaskInfo.UpdateByPod(*v1.Pod) to update any field that might change. However, this approach has some drawbacks. For example, when a new field is added to Task, we also need to update the UpdateByPod function. Unfortunately, I can’t think of a better solution for now.

Correct, so we need to add some comments to tell other contributors they need to both modify UpdateByPod and NewTaskInfo, but I think it's worth because update pods is frequent and it will do cause pressure on the go gc.

What do you think? @kingeasternsun @hwdef @lowang-bh @hzxuzhonghu @wangyang0616

JesseStutler avatar Nov 06 '25 09:11 JesseStutler

What if we follow the pattern of NewTaskInfo and update only the fields that need updating? We could extract a common function.

Thanks @JesseStutler In practice, almost any field may change. Therefore, in Bilibili, we use TaskInfo.UpdateByPod(*v1.Pod) to update any field that might change. However, this approach has some drawbacks. For example, when a new field is added to Task, we also need to update the UpdateByPod function. Unfortunately, I can’t think of a better solution for now.

Correct, so we need to add some comments to tell other contributors they need to both modify UpdateByPod and NewTaskInfo, but I think it's worth because update pods is frequent and it will do cause pressure on the go gc.

What do you think? @kingeasternsun @hwdef @lowang-bh @hzxuzhonghu @wangyang0616

I agree, It's worth. We could further refactor the code by extracting a common internal function, updateTaskInfo(), which can be reused by both UpdateByPod and NewTaskInfo

kingeasternsun avatar Nov 07 '25 01:11 kingeasternsun

I also found the scheduler memory increase problem in our large-scale cluster。 heap_inuse (1)

image

In the end, we solved this problem: volcano-scheduler memory leak problem

fengruotj avatar Nov 24 '25 02:11 fengruotj