EmbedSeg icon indicating copy to clipboard operation
EmbedSeg copied to clipboard

RuntimeError: CUDA out of memory.

Open Saharkakavand opened this issue 3 years ago • 8 comments

I have 4 images, and batch size is only 1. but when I start the begin_training(train_dataset_dict, val_dataset_dict, model_dict, loss_dict, configs), I have RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 31.75 GiB total capacity; 30.71 GiB already allocated; 62.50 MiB free; 12.93 MiB cached). Please let me know how can I solve it. Thanks

Saharkakavand avatar Jul 02 '21 14:07 Saharkakavand

Hello @Saharkakavand. Thank you for opening an issue and trying out EmbedSeg! Could you share the size of the original images and the crop_size which you used in 01-data.ipynb? A trivial reason for this out of memory error could just be that additional notebooks are open - if so, shutting them down should release the GPU memory. This is how the Running tab would look ideally: image

lmanan avatar Jul 02 '21 16:07 lmanan

@MLbyML , thank your for your replay, original 3d image size: 713 x 806 x 714, crop: 256 x 256 x256 there is no notebook running, and before running the last cell for train the network there is no process on GPU based on nvidia-smi -l command

Saharkakavand avatar Jul 02 '21 22:07 Saharkakavand

sorry, I I have 5 images with these shapes in data directory img.shape[0]=[713, 748, 797, 791, 972] img.shape[1]=[806, 787, 677, 798, 364] img.shape[2] =[714, 772, 772, 783, 816] crops are in this size: crops.shape[0] = 256 crops.shape[1] = 256 crops.shape[2] = 256

Saharkakavand avatar Jul 03 '21 08:07 Saharkakavand

Okay, so these look like confocal volume images since the size of the z dimension appears almost the same as x and y dimensions, is that correct? This set of notebooks runs it for in-situ specimens imaged under confocal microscopy, just for reference. I would have to dig a bit more into how GPU memory scales with crop_size and can get back to you. For now, I would recommend bringing down the crop_size by half - so something like 128 x128 x 128 (X x Y x Z) if the z voxel size is roughly the same as x and y voxel size. In case the z dimension is downsampled, you could also try 256 x 256 x 64 (X x Y x Z) and set the anisotropy_factor appropriately.
Since you may have to generate the crops again, you can increase the speed_up factor to 3 or higher to get these crops generated quicker. Let me know if you have questions. Thank you!

lmanan avatar Jul 03 '21 08:07 lmanan

Hello @MLbyML , thank you for your reply. I changed the crop size, it works but after 60 epochs still the train loss and val loss is 1.03. I just tried to use prediction code for the two images I have as a test set but I have this error: RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR.

3-D testdataloader created! Accessing data from ../../../data/fiber/test/ Number of images intestdirectory is 2 Number of instances intestdirectory is 2 Number of center images intest` directory is 0


Creating branched erfnet 3d with [6, 1] classes

0%| | 0/2 [00:25<?, ?it/s]


RuntimeError Traceback (most recent call last) in ----> 1 begin_evaluating(test_configs, verbose = True, avg_bg = avg_background_intensity/normalization_factor)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/test.py in begin_evaluating(test_configs, verbose, mask_region, mask_intensity, avg_bg) 67 grid_x=test_configs['grid_x'], grid_y=test_configs['grid_y'], grid_z=test_configs['grid_z'], 68 pixel_x=test_configs['pixel_x'], pixel_y=test_configs['pixel_y'],pixel_z=test_configs['pixel_z'], ---> 69 one_hot=test_configs['dataset']['kwargs']['one_hot'], mask_region= mask_region, mask_intensity=mask_intensity, avg_bg = avg_bg) 70 71

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/test.py in test_3d(verbose, grid_x, grid_y, grid_z, pixel_x, pixel_y, pixel_z, one_hot, mask_region, mask_intensity, avg_bg) 255 output = torch.from_numpy(output_average).float().cuda() 256 else: --> 257 output = model(im) 258 259 instance_map, predictions = cluster.cluster(output[0],

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs) 491 result = self._slow_forward(*input, **kwargs) 492 else: --> 493 result = self.forward(*input, **kwargs) 494 for hook in self._forward_hooks.values(): 495 hook_result = hook(self, input, result)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs) 150 return self.module(*inputs[0], **kwargs[0]) 151 replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) --> 152 outputs = self.parallel_apply(replicas, inputs, kwargs) 153 return self.gather(outputs, self.output_device) 154

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs) 160 161 def parallel_apply(self, replicas, inputs, kwargs): --> 162 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) 163 164 def gather(self, outputs, output_device):

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices) 81 output = results[i] 82 if isinstance(output, Exception): ---> 83 raise output 84 outputs.append(output) 85 return outputs

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py in _worker(i, module, input, kwargs, device) 57 if not isinstance(input, (list, tuple)): 58 input = (input,) ---> 59 output = module(*input, **kwargs) 60 with lock: 61 results[i] = output

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs) 491 result = self._slow_forward(*input, **kwargs) 492 else: --> 493 result = self.forward(*input, **kwargs) 494 for hook in self._forward_hooks.values(): 495 hook_result = hook(self, input, result)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/models/BranchedERFNet_3d.py in forward(self, input, only_encode) 36 output = self.encoder(input) 37 ---> 38 return torch.cat([decoder.forward(output) for decoder in self.decoders], 1) 39 40

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/models/BranchedERFNet_3d.py in (.0) 36 output = self.encoder(input) 37 ---> 38 return torch.cat([decoder.forward(output) for decoder in self.decoders], 1) 39 40

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/models/erfnet_3d.py in forward(self, input) 141 142 for layer in self.layers: --> 143 output = layer(output) 144 145 output = self.output_conv(output)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs) 491 result = self._slow_forward(*input, **kwargs) 492 else: --> 493 result = self.forward(*input, **kwargs) 494 for hook in self._forward_hooks.values(): 495 hook_result = hook(self, input, result)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/EmbedSeg/models/erfnet_3d.py in forward(self, input) 49 50 def forward(self, input): ---> 51 output = self.conv3x1x1_1(input) 52 output = F.relu(output) 53 output = self.conv1x3x1_1(output)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs) 491 result = self._slow_forward(*input, **kwargs) 492 else: --> 493 result = self.forward(*input, **kwargs) 494 for hook in self._forward_hooks.values(): 495 hook_result = hook(self, input, result)

/beegfs/desy/user/kakavs/miniconda3/envs/fiber/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input) 474 self.dilation, self.groups) 475 return F.conv3d(input, self.weight, self.bias, self.stride, --> 476 self.padding, self.dilation, self.groups) 477 478

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

` Screenshot from 2021-07-06 11-28-47

Saharkakavand avatar Jul 06 '21 09:07 Saharkakavand

Hello @Saharkakavand Thanks for giving this a go. To me, the loss after 60 epochs looks reasonable- probably a better indicator of whether the training is stagnating is by looking at the loss.png saved at experiment/$data$-demo/ The error message you pointed out seems to be coming from running the inference on multiple GPUs parallely and might be a bug in the code that we need to fix - not sure at the moment. Is GPU0 being used for training while you try to run the prediction notebook simultaneously? (maybe stopping the training notebook could help before running the prediction notebook? You can always resume training later from the last checkpoint by using the resume_path variable) If this doesn't help, I will dig deeper and let you know.

lmanan avatar Jul 06 '21 11:07 lmanan

I stopped training and run the test on the same GPU. I attached the loss.png, it didn't change much as you can see Screenshot from 2021-07-06 13-16-27

Saharkakavand avatar Jul 06 '21 11:07 Saharkakavand

The iou profile appears strange to me- my understanding is that the iou shouldn't really go down unless the validation and train images are quite different in their appearance. Would it be possible for me to look at the images and instance masks which you are training the network on? Maybe I can try running them on my setup here, if that helps? (If sharing original images is not possible, then would sharing downsampled versions of the images work?)

lmanan avatar Jul 06 '21 11:07 lmanan