KPConv-PyTorch
KPConv-PyTorch copied to clipboard
Have a BUG, leading to the continuous increase in the memory usage of the graphics card
The function 'collate_fn' of dataloader, example : S3DISCustomBatch, ModelNet40CustomBatch eg. will leading to the continuous increase in the memory usage of the graphics card.
My English is not good, in order to express the problem more clearly, please forgive me for using Chinese。
在dataloader中,collate_fn 返回类本身作为数据。 在训练过程中,类属性转移到显存上进行训练,训练完毕后,更新类属性作为下一次训练数据。 这在逻辑上是没有问题的,在cpu上进行训练时,也是没有问题的。 但是,在kpconv中,由于采样数不确定,造成类属性是一直变动的,在显存中开辟的位置也会不同。由于显存不会自动释放,这就造成了显存占用的持续累积。
The sample solution is add the next code after loss backward.
torch.cuda.empty_cache()
But this solution to empty cache is asynchronous, which means it doesn't happen at the same time.
Hi @yatengLG,
Thank you for your help. I tried to google translate your explanation and this is what I understood:
The collate_fn
return an object instance of the batch class for example:
https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/cf1f06381ef062344b68286e3d5034b1aa94aecd/datasets/S3DIS.py#L1488-L1489
During training, to('cuda')
is called, therefore, the attributes of the batch class are loaded on the GPU
After the training step, these attributes are updated as the next training data, but the GPU memory is not released between two training steps, meaning, the GPU memory will increase over time.
Is that correct?
I will add
torch.cuda.empty_cache()
to the code as this is a simple solution
Hi @yatengLG,
Thank you for your help. I tried to google translate your explanation and this is what I understood:
The
collate_fn
return an object instance of the batch class for example:https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/cf1f06381ef062344b68286e3d5034b1aa94aecd/datasets/S3DIS.py#L1488-L1489
During training,
to('cuda')
is called, therefore, the attributes of the batch class are loaded on the GPUAfter the training step, these attributes are updated as the next training data, but the GPU memory is not released between two training steps, meaning, the GPU memory will increase over time.
Is that correct?
Yes ,thats my mean. But I can
t confirm the reason too.
Maybe it`s because the neighbors of the batch class has diferent size. I try to return neighbors as a list, GPU memory also increase.
torch.cuda.empty_cache() is effective.
Before add it ,the memory is increase over time. The GPU rtx2080ti have 11G memory always out of memory when i train kp with more than 2000 datas.
After add it , the memort used is 1.4G enought.
The above test batchsize is 1, the number of max points is 7000.