fix scheduler memory leak
Volcano has been fully implemented in the Bilibili transcoding scenario. As we migrated more tasks to Volcano, we also discovered more issues. Memory leakage is one of the most important issues. As other users have discovered, the memory of the Volcano scheduler gradually increases until it triggers an OOM (Out Of Memory) restart.
During the lifecycle of a Pod, there will be multiple update events. Currently, each pod update triggers the destruction and recreation of tasks, and we have found this to be the root cause of the problem. After we switched to reusing task objects, the scheduler's memory stabilized, even after running for a long time.
This fix is also logical, as excessively frequent allocation of new memory can lead to memory fragmentation, and the garbage collector will also be under pressure. Sometimes Go runtime doesn't even have time to merge small memory blocks that are no longer in use.
/cc
Hello, I would like to confirm if my understanding is correct. The main question is, when updating, instead of deleting the original in-memory object and adding it again, should we update the original object and then add it again? Is that the intended meaning?
/cc
Hello, I would like to confirm if my understanding is correct. The main question is, when updating, instead of deleting the original in-memory object and adding it again, should we update the original object and then add it again? Is that the intended meaning?
I am honored to discuss this issue with you.
Yes, it is my meaning. To be precise, we should reuse the existing TaskInfo struct rather than discarding it and creating a new one. Because when the cluster is large enough, there are many pods, and the task state changes rapidly, this will involve a significant amount of memory allocation and deallocation.
/cc Hello, I would like to confirm if my understanding is correct. The main question is, when updating, instead of deleting the original in-memory object and adding it again, should we update the original object and then add it again? Is that the intended meaning?
I am honored to discuss this issue with you.
Yes, it is my meaning. To be precise, we should reuse the existing TaskInfo struct rather than discarding it and creating a new one. Because when the cluster is large enough, there are many pods, and the task state changes rapidly, this will involve a significant amount of memory allocation and deallocation.
How about use syncpool, and then check weather the metrics have a change?
About syncpool , can you describe it in more detail? @lowang-bh
func (ti *TaskInfo) Clone() *TaskInfo Pod: ti.Pod,
I see in this function, pod is not deepcopied, which may cause garbage collector not collect when pod is deleted while scheduling
About
syncpool, can you describe it in more detail? @lowang-bh
As JobInfo and TaskInfo are repeatly created and deleted, we can use syncpool to store the unused object and then take it when need instead of creating a new one, so as to reduce gc presure.
func (ti *TaskInfo) Clone() *TaskInfo Pod: ti.Pod,
I see in this function, pod is not deepcopied, which may cause garbage collector not collect when pod is deleted while scheduling
But the pod will eventually be GC after one scheduler ended,right? I think it's not the main problem here: )
I'm still a little confused that if this is the root cause, if we stop submitting jobs the GC will not have much preessure and finally the memory will be GC, have you tested this case?
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Still need to be considered
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
still need
I also found the scheduler memory increase problem in our large-scale cluster。
/cc
In our case, the scheduler 's go_memstats_alloc_bytes metrics (Number of bytes allocated and still in use) has been rising .
What if we follow the pattern of NewTaskInfo and update only the fields that need updating? We could extract a common function.
Thanks @JesseStutler
In practice, almost any field may change. Therefore, in Bilibili, we use TaskInfo.UpdateByPod(*v1.Pod) to update any field that might change. However, this approach has some drawbacks. For example, when a new field is added to Task, we also need to update the UpdateByPod function. Unfortunately, I can’t think of a better solution for now.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign thor-wl for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
CC @JesseStutler
Hey, could you take a look at the CI process and review my code when you have time?
/gemini review
CC @JesseStutler
Hey, could you take a look at the CI process and review my code when you have time?
@Wang-Kai I think it's because that current logic is not correct because other PR can pass CIs, in my local environment, I use your code build components but also will fail on the e2e testing
What if we follow the pattern of NewTaskInfo and update only the fields that need updating? We could extract a common function.
Thanks @JesseStutler
In practice, almost any field may change. Therefore, in Bilibili, we use
TaskInfo.UpdateByPod(*v1.Pod)to update any field that might change. However, this approach has some drawbacks. For example, when a new field is added toTask, we also need to update theUpdateByPodfunction. Unfortunately, I can’t think of a better solution for now.
Correct, so we need to add some comments to tell other contributors they need to both modify UpdateByPod and NewTaskInfo, but I think it's worth because update pods is frequent and it will do cause pressure on the go gc.
What do you think? @kingeasternsun @hwdef @lowang-bh @hzxuzhonghu @wangyang0616
What if we follow the pattern of NewTaskInfo and update only the fields that need updating? We could extract a common function.
Thanks @JesseStutler In practice, almost any field may change. Therefore, in Bilibili, we use
TaskInfo.UpdateByPod(*v1.Pod)to update any field that might change. However, this approach has some drawbacks. For example, when a new field is added toTask, we also need to update theUpdateByPodfunction. Unfortunately, I can’t think of a better solution for now.Correct, so we need to add some comments to tell other contributors they need to both modify
UpdateByPodandNewTaskInfo, but I think it's worth because update pods is frequent and it will do cause pressure on the go gc.What do you think? @kingeasternsun @hwdef @lowang-bh @hzxuzhonghu @wangyang0616
I agree, It's worth. We could further refactor the code by extracting a common internal function, updateTaskInfo(), which can be reused by both UpdateByPod and NewTaskInfo
I also found the scheduler memory increase problem in our large-scale cluster。
![]()
In the end, we solved this problem: volcano-scheduler memory leak problem