PHC icon indicating copy to clipboard operation
PHC copied to clipboard

better maually cleanup gpu memory when loading motions

Open luoye2333 opened this issue 1 year ago • 5 comments

often meets CUDA out of memory in the stage of evaluating the model (which periodically called after 1500 iterations of training).

In motion_lib_real.py line 199 we load the motions in memory and then transfer them into gpu tensors. Then given to class variables (e.g. self.gts). Perhaps tensors loaded previously in self.gts are not cleaned automatically.

self.gts = torch.cat([m.global_translation for m in motions], dim=0).float().to(self._device)
self.grs = torch.cat([m.global_rotation for m in motions], dim=0).float().to(self._device)
self.lrs = torch.cat([m.local_rotation for m in motions], dim=0).float().to(self._device)
self.grvs = torch.cat([m.global_root_velocity for m in motions], dim=0).float().to(self._device)
self.gravs = torch.cat([m.global_root_angular_velocity for m in motions], dim=0).float().to(self._device)
self.gavs = torch.cat([m.global_angular_velocity for m in motions], dim=0).float().to(self._device)
self.gvs = torch.cat([m.global_velocity for m in motions], dim=0).float().to(self._device)
self.dvs = torch.cat([m.dof_vels for m in motions], dim=0).float().to(self._device)

So better manually clean the cache before we load.

self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs = None, None, None, None, None, None, None, None
gc.collect(); torch.cuda.empty_cache()

the same in line 208

self.gts_t, self.grs_t, self.gvs_t, self.gavs_t = None, None, None, None
gc.collect(); torch.cuda.empty_cache()

and line 214

self.dof_pos = None
gc.collect(); torch.cuda.empty_cache()

Helps me train on single RTX 4090. But im not sure if this is the case. It is wierd that the memory is not cleaned up automatically after assigning new data on the old variables.

luoye2333 avatar Nov 13 '24 06:11 luoye2333

Found a strange thing. There is already cleanup codes in motion_lib_real.py line 77, but it is commented out.

# if "gts" in self.__dict__:
#     del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
#     del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
#     if "gts_t" in self.__dict__:
#         self.gts_t, self.grs_t, self.gvs_t
#     if flags.real_traj:
#         del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

change to this

if "gts" in self.__dict__:
    del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
    del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
if "gts_t" in self.__dict__:
    del self.gts_t, self.grs_t, self.gvs_t, self.gavs_t
if "dof_pos" in self.__dict__:
    del self.dof_pos
if flags.real_traj:
    del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

luoye2333 avatar Nov 13 '24 08:11 luoye2333

can further cut down gpu memory usage by cleaning variables after the last evaluation. The variables in env._motion_env_lib is not cleared when switching _motion_lib to motion_train_lib after finished evaluation, and we dont need it when training. Also we still need to load it the next time after 1500 epoches of train.

in phc/learning/im_amp.py line 227

humanoid_env._motion_eval_lib.clear_cache() # add this
humanoid_env._motion_lib = humanoid_env._motion_train_lib

in phc/utils/motion_lib_real.py add this function

def clear_cache(self):
    if "gts" in self.__dict__:
        del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
        del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
    if "gts_t" in self.__dict__:
        del self.gts_t, self.grs_t, self.gvs_t, self.gavs_t
    if "dof_pos" in self.__dict__:
        del self.dof_pos
    if flags.real_traj:
        del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

It is also possible to clear _motion_train_lib when entering evaluation in line 178 in im_amp.py, but it seems ok now.

A typical gpu memory usage time line when using num_envs=2048 : 5 GB is allocated by gym simulation 12.5 GB to load training variables 5 GB (at peak) to loading evaluation variables. After evaluation, it comes back to 5+12.5 GB. gpu0_memory_usage

luoye2333 avatar Nov 18 '24 05:11 luoye2333

Thanks for pointing this out! Indeed I have this cleaning up code in older versions but removed it since I have faced issues with it when using mujoco for visualization. Essentially, the motion states will be deleted when I try to interact with the UI (to request the next frame motion).

Feel free to create a pull request for this. Thanks! @luoye2333

ZhengyiLuo avatar Dec 10 '24 01:12 ZhengyiLuo

"I also have a single RTX 4090, and after training for 1500 episodes, it gets randomly killed during evaluation. Although there is no 'out of memory' message in the terminal, I think we might be facing the same issue. Thank you very much for your suggestion."

onlyloveyanzi avatar Aug 10 '25 09:08 onlyloveyanzi

can further cut down gpu memory usage by cleaning variables after the last evaluation. The variables in env._motion_env_lib is not cleared when switching _motion_lib to motion_train_lib after finished evaluation, and we dont need it when training. Also we still need to load it the next time after 1500 epoches of train.

in phc/learning/im_amp.py line 227

humanoid_env._motion_eval_lib.clear_cache() # add this
humanoid_env._motion_lib = humanoid_env._motion_train_lib

in phc/utils/motion_lib_real.py add this function

def clear_cache(self):
    if "gts" in self.__dict__:
        del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
        del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
    if "gts_t" in self.__dict__:
        del self.gts_t, self.grs_t, self.gvs_t, self.gavs_t
    if "dof_pos" in self.__dict__:
        del self.dof_pos
    if flags.real_traj:
        del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

It is also possible to clear _motion_train_lib when entering evaluation in line 178 in im_amp.py, but it seems ok now.

A typical gpu memory usage time line when using num_envs=2048 : 5 GB is allocated by gym simulation 12.5 GB to load training variables 5 GB (at peak) to loading evaluation variables. After evaluation, it comes back to 5+12.5 GB. gpu0_memory_usage

Hello, how do you monitor real-time memory usage? Do you have any related code? I encounter crashes every time I evaluate at 7500 steps, but when I check with nvidia-smi, there is still a lot of GPU memory available.

onlyloveyanzi avatar Aug 11 '25 06:08 onlyloveyanzi