KPConv-PyTorch icon indicating copy to clipboard operation
KPConv-PyTorch copied to clipboard

Have a BUG, leading to the continuous increase in the memory usage of the graphics card

Open yatengLG opened this issue 3 years ago • 3 comments

The function 'collate_fn' of dataloader, example : S3DISCustomBatch, ModelNet40CustomBatch eg. will leading to the continuous increase in the memory usage of the graphics card.

My English is not good, in order to express the problem more clearly, please forgive me for using Chinese。

在dataloader中,collate_fn 返回类本身作为数据。 在训练过程中,类属性转移到显存上进行训练,训练完毕后,更新类属性作为下一次训练数据。 这在逻辑上是没有问题的,在cpu上进行训练时,也是没有问题的。 但是,在kpconv中,由于采样数不确定,造成类属性是一直变动的,在显存中开辟的位置也会不同。由于显存不会自动释放,这就造成了显存占用的持续累积。

The sample solution is add the next code after loss backward.

torch.cuda.empty_cache()

But this solution to empty cache is asynchronous, which means it doesn't happen at the same time.

yatengLG avatar Sep 27 '21 03:09 yatengLG

Hi @yatengLG,

Thank you for your help. I tried to google translate your explanation and this is what I understood:

The collate_fn return an object instance of the batch class for example:

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/cf1f06381ef062344b68286e3d5034b1aa94aecd/datasets/S3DIS.py#L1488-L1489

During training, to('cuda') is called, therefore, the attributes of the batch class are loaded on the GPU

After the training step, these attributes are updated as the next training data, but the GPU memory is not released between two training steps, meaning, the GPU memory will increase over time.

Is that correct?

HuguesTHOMAS avatar Sep 27 '21 15:09 HuguesTHOMAS

I will add

torch.cuda.empty_cache()

to the code as this is a simple solution

HuguesTHOMAS avatar Sep 27 '21 15:09 HuguesTHOMAS

Hi @yatengLG,

Thank you for your help. I tried to google translate your explanation and this is what I understood:

The collate_fn return an object instance of the batch class for example:

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/cf1f06381ef062344b68286e3d5034b1aa94aecd/datasets/S3DIS.py#L1488-L1489

During training, to('cuda') is called, therefore, the attributes of the batch class are loaded on the GPU

After the training step, these attributes are updated as the next training data, but the GPU memory is not released between two training steps, meaning, the GPU memory will increase over time.

Is that correct?

Yes ,thats my mean. But I cant confirm the reason too. Maybe it`s because the neighbors of the batch class has diferent size. I try to return neighbors as a list, GPU memory also increase.

torch.cuda.empty_cache() is effective. Before add it ,the memory is increase over time. The GPU rtx2080ti have 11G memory always out of memory when i train kp with more than 2000 datas. After add it , the memort used is 1.4G enought.
The above test batchsize is 1, the number of max points is 7000.

yatengLG avatar Sep 28 '21 01:09 yatengLG