Pytorch-UNet icon indicating copy to clipboard operation
Pytorch-UNet copied to clipboard

train Detected OutOfMemoryError

Open num-doc opened this issue 2 years ago • 7 comments

ERROR: Detected OutOfMemoryError! Enabling checkpointing to reduce memory usage, but this slows down training. Consider enabling AMP (--amp) for fast and memory efficient training

win10 python 3.7 torch-gpu 1.12 cuda 11.3 gtx 1080TI

num-doc avatar Feb 06 '23 11:02 num-doc

You are running out of memory, please reduce the scaling of images, use a larger GPU or enable AMP as the message suggests.

milesial avatar Feb 06 '23 12:02 milesial

Thanks for answering. I don't think it is a problem of memory.Because I just use 30 640*640-pixel images. And i have run it successfully before with your code. is it another possibility here?

num-doc avatar Feb 09 '23 01:02 num-doc

Well the error is Detected OutOfMemoryError! so it is definitely a memory error. Check that nothing else is running on your GPU like UI or other workloads.

milesial avatar Feb 09 '23 18:02 milesial

train got error:

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/media/lee/common/PycahrmProjects/Pytorch-UNet-master/train.py", line 43, in train_model dataset = CarvanaDataset(dir_img, dir_mask, img_scale) File "/media/lee/common/PycahrmProjects/Pytorch-UNet-master/utils/data_loading.py", line 118, in init super().init(images_dir, mask_dir, scale, mask_suffix='_mask') File "/media/lee/common/PycahrmProjects/Pytorch-UNet-master/utils/data_loading.py", line 54, in init unique = list(tqdm( File "/home/lee/anaconda3/envs/persondet/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter for obj in iterable: File "/home/lee/anaconda3/envs/persondet/lib/python3.8/multiprocessing/pool.py", line 868, in next raise value IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/media/lee/common/PycahrmProjects/Pytorch-UNet-master/train.py", line 212, in train_model( File "/media/lee/common/PycahrmProjects/Pytorch-UNet-master/train.py", line 45, in train_model dataset = BasicDataset(dir_img, dir_mask, img_scale) File "/media/lee/common/PycahrmProjects/Pytorch-UNet-master/utils/data_loading.py", line 59, in init self.mask_values = list(sorted(np.unique(np.concatenate(unique), axis=0).tolist())) File "<array_function internals>", line 180, in concatenate ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 75 has 1 dimension(s)

willianLee avatar Jul 24 '23 07:07 willianLee

@willianLee I think you need to check the mapping between label and image, which may be a problem of initial configuration, such as checking the number of classes or whether label production is normal

num-doc avatar Jul 24 '23 08:07 num-doc

thanks for this. i got new problem, when i set model channel=3, it report error: AssertionError: Network has been defined with 3 input channels, but loaded images have 1 channels. Please check that the images are loaded correctly.

so i set model channel=1 and got error: AssertionError: Network has been defined with 1 input channels, but loaded images have 3 channels. Please check that the images are loaded correctly.

It's a bit strange !

willianLee avatar Jul 24 '23 08:07 willianLee

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

HeartbeatD avatar Jul 04 '24 02:07 HeartbeatD