[Bug]: ReplayBuffer cannot detect the real memory constraints in Kubernetes.
🐛 Bug
In stable_baselines3/common/buffers.py, around line 200, it uses psutil to determine available memories, which fails to detect the actually available memory.
# in the `__init__` of `ReplayBuffer`
if psutil is not None:
mem_available = psutil.virtual_memory().available
My server has over 700G memory but only 64 G available in my k8s container. I ran a training script, which is estimated to require 30 G memory. As a consequence, it is terminated silently. If warned earlier by StableBaslines3, other users may be spared of this potential problem.
To Reproduce
No response
Relevant log output / Error message
System Info
No response
Checklist
- [x] My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
- [x] I have checked that there is no similar issue in the repo
- [x] I have read the documentation
- [x] I have provided a minimal and working example to reproduce the bug
- [x] I've used the markdown code blocks for both code and stack traces.
Hello,
but only 64 G available in my k8s container. I ran a training script, which is estimated to require 30 G memory.
if there is 64GB available and you are using 30GB, it should be fine, no?
Hello,
but only 64 G available in my k8s container. I ran a training script, which is estimated to require 30 G memory.
if there is 64GB available and you are using 30GB, it should be fine, no?
Another running script takes 40G, to clarify.
It is ok to decrease the buffer_size but not being notified of potential "Out of Memory" is bad.
It is ok to decrease the buffer_size but not being notified of potential "Out of Memory" is bad.
Do you have a proposed fix?
By reading "/sys/fs/cgroup/memory.max" and "/sys/fs/cgroup/memory.current", I can manually check the amount of free memory. For older versions of Linux, read "/sys/fs/cgroup/memory/memory.limit_in_bytes". This method works fine on my Linux.
I am not expert enough to offer a solution that works fine on all containers and systems.
Do you have a proposed fix?
Maybe I'll ask one of my experienced co-worker to improve on RaplayBuffer and make a PR.