SoftGroup icon indicating copy to clipboard operation
SoftGroup copied to clipboard

process killed by computer

Open DurbinLiu opened this issue 2 years ago • 8 comments

Hello, when I run the command ./tools/dist_train.sh configs/softgroup_scannet.yaml 1 I met the following problem: my process got killed by my computer after running several epochs. image I searched for the issue, and found it was caused by oom. I was using the single 3090 GPU, and set batchsize=4, num_workers=4, and I think it shouldd't cause out of memory, noting that it can run some epochs. Do u konw why and how to deal with the issue? Hoping for your reply, many thanks!

DurbinLiu avatar Sep 08 '22 03:09 DurbinLiu

I also encountered the same problem. Running the test on a single RTX3090 24GB shows cuda error: an illegal memory access was encountered.

wsk12345 avatar Sep 08 '22 20:09 wsk12345

I am not very sure. Could you check with --skip-validation flag. You can also resume training with --resume

thangvubk avatar Sep 09 '22 01:09 thangvubk

Thank you for your advice. But I have no intention of training and want to test on point cloud data.

wsk12345 avatar Sep 09 '22 03:09 wsk12345

@wsk12345 which dataset are you using

thangvubk avatar Sep 09 '22 03:09 thangvubk

@thangvubk Thank you for your prompt reply. I am using a custom dataset. A single scene has about 5e6 points.

wsk12345 avatar Sep 09 '22 03:09 wsk12345

I ran into "illegal memory access error", it was caused by the radius being too large for the dataset I was training on, it may also have the same effect while testing

KtK99 avatar Sep 09 '22 04:09 KtK99

If you have memory errors with custom datasets, i suggest checking the input spatial_shape. Spconv2 may not support the input with too large spatial shapes (e.g., 3000x3000x1000).

thangvubk avatar Sep 11 '22 15:09 thangvubk

I am getting the OOM error while testing for S3DIS dataset on a single RTX6000.

Krupal09 avatar Sep 29 '22 14:09 Krupal09