stable-baselines3 icon indicating copy to clipboard operation
stable-baselines3 copied to clipboard

[Bug]: ReplayBuffer cannot detect the real memory constraints in Kubernetes.

Open LarryLiZimo opened this issue 2 months ago • 6 comments

🐛 Bug

In stable_baselines3/common/buffers.py, around line 200, it uses psutil to determine available memories, which fails to detect the actually available memory.

# in the `__init__` of `ReplayBuffer`
if psutil is not None: 
    mem_available = psutil.virtual_memory().available

My server has over 700G memory but only 64 G available in my k8s container. I ran a training script, which is estimated to require 30 G memory. As a consequence, it is terminated silently. If warned earlier by StableBaslines3, other users may be spared of this potential problem.

To Reproduce

No response

Relevant log output / Error message


System Info

No response

Checklist

  • [x] My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • [x] I have checked that there is no similar issue in the repo
  • [x] I have read the documentation
  • [x] I have provided a minimal and working example to reproduce the bug
  • [x] I've used the markdown code blocks for both code and stack traces.

LarryLiZimo avatar Nov 01 '25 10:11 LarryLiZimo

Hello,

but only 64 G available in my k8s container. I ran a training script, which is estimated to require 30 G memory.

if there is 64GB available and you are using 30GB, it should be fine, no?

araffin avatar Nov 01 '25 10:11 araffin

Hello,

but only 64 G available in my k8s container. I ran a training script, which is estimated to require 30 G memory.

if there is 64GB available and you are using 30GB, it should be fine, no?

Another running script takes 40G, to clarify.

LarryLiZimo avatar Nov 01 '25 10:11 LarryLiZimo

It is ok to decrease the buffer_size but not being notified of potential "Out of Memory" is bad.

LarryLiZimo avatar Nov 01 '25 10:11 LarryLiZimo

It is ok to decrease the buffer_size but not being notified of potential "Out of Memory" is bad.

Do you have a proposed fix?

araffin avatar Nov 01 '25 10:11 araffin

By reading "/sys/fs/cgroup/memory.max" and "/sys/fs/cgroup/memory.current", I can manually check the amount of free memory. For older versions of Linux, read "/sys/fs/cgroup/memory/memory.limit_in_bytes". This method works fine on my Linux.

I am not expert enough to offer a solution that works fine on all containers and systems.

LarryLiZimo avatar Nov 01 '25 10:11 LarryLiZimo

Do you have a proposed fix?

Maybe I'll ask one of my experienced co-worker to improve on RaplayBuffer and make a PR.

LarryLiZimo avatar Nov 01 '25 11:11 LarryLiZimo