volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Volcano Scheduler Memory leak

Open tianshimoyi opened this issue 2 years ago • 13 comments

What happened:

Check the memory usage of volcano scheduler. It is consistent with the increase. After restarting, the memory plummets.

Monitor before restarting

截屏2023-11-16 10 26 47

Monitor after restart

截屏2023-11-16 10 29 27

Heap memory usage couple before restart

截屏2023-11-16 10 30 35

Heap memory usage couple after reboot

截屏2023-11-16 10 31 15

tianshimoyi avatar Nov 16 '23 02:11 tianshimoyi

Hi, have you set memory limit of container resource limit and what's the log level you set?

Monokaix avatar Nov 16 '23 12:11 Monokaix

@Monokaix Yes, I set the limit value of the container. I previously set the log level to 4. I also suspected that it was because the log level was set too high. I have now changed it to 0 for verification.

tianshimoyi avatar Nov 16 '23 13:11 tianshimoyi

@Monokaix After modifying the log level, the memory is still growing. Log printing speed is not fast either 截屏2023-11-16 21 17 28 截屏2023-11-16 21 18 30

tianshimoyi avatar Nov 16 '23 13:11 tianshimoyi

Can you wait memory hit the limit and check whether memory reclamation happened?

Monokaix avatar Nov 17 '23 01:11 Monokaix

@Monokaix I'll give it a try. I found a problem. The scheduler has the nodeSelector option, but the pods do not consider filtering out the situations that are not on these nodes, causing the memory to maintain the pod information on all nodes in the entire cluster, resulting in a waste of memory.

tianshimoyi avatar Nov 17 '23 05:11 tianshimoyi

@Monokaix I'll give it a try. I found a problem. The scheduler has the nodeSelector option, but the pods do not consider filtering out the situations that are not on these nodes, causing the memory to maintain the pod information on all nodes in the entire cluster, resulting in a waste of memory.

That's right, but I think it's not the main cause of high memory usage: ). Much log and delayed memory reclamation may cause this phenomenon and memory can be reclaimed when it about ro hit the memory limit.

Monokaix avatar Nov 17 '23 07:11 Monokaix

@Monokaix Thank you very much. I will lower the limit and observe it for a while.

tianshimoyi avatar Nov 17 '23 08:11 tianshimoyi

@Monokaix I reduced the memory limit, but oom did not trigger gc, and the memory continued to increase. 截屏2023-11-17 18 16 56

截屏2023-11-17 18 17 23

tianshimoyi avatar Nov 17 '23 10:11 tianshimoyi

You can use kill -12 $volcano-scheduler pid to dump cache info. https://github.com/volcano-sh/volcano/pull/3088

Monokaix avatar Nov 21 '23 02:11 Monokaix

@Monokaix Thank you very much. I will lower the limit and observe it for a while.

What's the resources num of you cluster? like nodes and pods, we should not lower the memory too low, and meet the basic memory needs first, then lower the memory limit a bit.

Monokaix avatar Nov 21 '23 02:11 Monokaix

Can this https://github.com/volcano-sh/volcano/pull/3435 fixed your problem?

Monokaix avatar May 14 '24 06:05 Monokaix