flower Memory leak when ray.remote are repeatedly called in simulation

Memory leak when ray.remote are repeatedly called in simulation

Open mofanv opened this issue 2 years ago • 0 comments

Describe the bug

ray.remote in ray_client_proxy.py are repeatedly called when running simulation. Because on default, it will go Ray::IDLE mode and rest in GPUs, the occupied GPU memory keeps increasing until CUDA out-of-memory errors happen.

My solution is to add the max_call parameter when calling like ray.remote(max_calls=1), based on ray documents here.

Steps/Code to Reproduce

Run the simulation_pytorch example with python main.py, and monitor memory usage with watch -n 0.1 nividia-smi

Expected Results

Clients release their occupied memory every communication round, to let other clients use it later in the next round.

Actual Results

All clients have been executed keep the memory, which will cause CUDA OOM.

Aug 14 '22 08:08 mofanv

flower flower copied to clipboard

Memory leak when ray.remote are repeatedly called in simulation

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

flower
flower copied to clipboard