ray icon indicating copy to clipboard operation
ray copied to clipboard

[RLlib] PolicyServerInput memory leak

Open MattiasDC opened this issue 2 years ago • 2 comments

What happened + What you expected to happen

PolicyServerInput keeps on storing samples in its samples_queue, no matter what the rate of sample generation is. If samples come in quicker than they are consumed, this leads to an ever-increasing memory usage. No warnings or anything were reported in the process of increasing memory. I had to debug why memory was ever-increasing and debugging the culprit took me several hours to track down.

At the very least I would expect some warning logging that the samples_queue is reaching very high memory usage (500GB), in my case in under 24 hours.

This is a well-known producer-consumer problem, and one solution would be to implement some kind of back-pressure mechanism. Alternatively, the queue size could be reduced in size when a certain 'max-size' parameter is reached when adding new samples. As a work-around I had to reduce the number of clients generating samples.

Versions / Dependencies

Ray: 2.0.0 Python: 3.8.10 Ubuntu: 20.04.05

Reproduction script

I would expect you can reproduce this if you start from the Cartpole Server example. Create a big network, while setting num_sgd_iter to a high number. Create multiple clients. Track the self.samples_queue size in the PolicyServerInput. If it starts to increase consistently, you will run out of memory.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

MattiasDC avatar Nov 09 '22 10:11 MattiasDC

@MattiasDC : PR in review :)

sven1977 avatar Jan 03 '23 13:01 sven1977

@MattiasDC : PR in review :)

Thanks for taking the time to fix this!

MattiasDC avatar Jan 03 '23 23:01 MattiasDC

Sorry to necro an old threa, but I manually implemented the PR and it fixed the issue for me without a problem - is there anything we can do to get it pulled into the main?

DenysAshikhin avatar May 03 '23 21:05 DenysAshikhin