flower
flower copied to clipboard
Memory leak when ray.remote are repeatedly called in simulation
Describe the bug
ray.remote
in ray_client_proxy.py are repeatedly called when running simulation. Because on default, it will go Ray::IDLE
mode and rest in GPUs, the occupied GPU memory keeps increasing until CUDA out-of-memory errors happen.
My solution is to add the max_call
parameter when calling like ray.remote(max_calls=1)
, based on ray documents here.
Steps/Code to Reproduce
Run the simulation_pytorch example with python main.py
, and monitor memory usage with watch -n 0.1 nividia-smi
Expected Results
Clients release their occupied memory every communication round, to let other clients use it later in the next round.
Actual Results
All clients have been executed keep the memory, which will cause CUDA OOM.