RelationalGraphLearning icon indicating copy to clipboard operation
RelationalGraphLearning copied to clipboard

memory leak for multi-human policies

Open huiwenzhang opened this issue 1 year ago • 3 comments

Hi, when running the multi-human policy, such as sarl, lstm-rl, I noticed that there is drastic memory increase with training going on. The used memory increased from about 4G to 20G after 100 episodes training. I debug for a long time, but still no clue about what's going wrong there. @ChanganVR Pls have a look.

huiwenzhang avatar Jun 23 '23 06:06 huiwenzhang

@huiwenzhang Not such issue has been reported before. Maybe you could check whether your pytorch and CUDA version are compatible. Sometimes that could have an effect on the memory consumption.

ChanganVR avatar Jun 24 '23 04:06 ChanganVR

@huiwenzhang Not such issue has been reported before. Maybe you could check whether your pytorch and CUDA version are compatible. Sometimes that could have an effect on the memory consumption.

I used pytorch version 2.0.1 with cuda version 11.8. The local cuda version is 12.1. According to the official doc of pytorch, newer cuda version is also supported. Besides, I didn't use GPU as you suggested. But the problem still exist. Training with cadrl and rgl policy is fine. Do you have any other guess about the memory leak?

huiwenzhang avatar Jun 24 '23 13:06 huiwenzhang

@huiwenzhang I see. I don't have a clue what could be causing the issue. You could debug by removing all codes and adding back parts by parts until the issue occurs.

ChanganVR avatar Jun 30 '23 09:06 ChanganVR