JVCR-3Dlandmark A random bug

Hi everyone,

When I train the net, I got a random bug. An error will occur in random bench

Processing |########################## | (50860/61225) Data: 2.597300s | Batch: 3.278s | Total: 0:56:45 |Processing |########################## | (50880/61225) Data: 0.000299s | Batch: 0.681s | Total: 0:56:46 |Processing |########################## | (50900/61225) Data: 0.000489s | Batch: 0.691s | Total: 0:56:47 |Processing |########################## | (50920/61225) Data: 0.000502s | Batch: 0.683s | Total: 0:56:47 |Processing |########################## | (50940/61225) Data: 2.483688s | Batch: 3.165s | Total: 0:56:50 | ETA: 0:10:09 | LOSS vox: 0.0337; coord: 0.0034 | NME: 0.3116Traceback (most recent call last): File "train.py", line 281, in main(parser.parse_args()) File "train.py", line 90, in main run(model, train_loader, mode, criterion_vox, criterion_coord, optimizer_G, optimizer_P) File "train.py", line 144, in run for i, (inputs, target, meta) in enumerate(data_loader): File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 623, in next return self._process_next_batch(batch) File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/jliu9/Codes/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type) File "/home/jliu9/Codes/JVCR-3Dlandmark/utils/imutils.py", line 123, in draw_labelvolume img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]] ValueError: could not broadcast input array from shape (7,7) into shape (7,8)

So, what's the problem?

Feb 22 '19 16:02 liumarcus70s

Replacing the int() with the np.int() in utils/imutils.py#L94-L95 may solve this problem.

Mar 09 '19 09:03 HongwenZhang

I meet this problem too. And modify int to np.int, this error still happens. I use pytorch0.4.0. Hope help! @HongwenZhang

Jul 03 '19 07:07 JackLongKing

Did you solve this problem? @liumarcus70s

Jul 03 '19 07:07 JackLongKing

Hi @JackLongKing, Could you print the value of ul, br, and pt when the bug occurs?

Jul 03 '19 08:07 HongwenZhang

Information Flow as follows: //============================================================================ ('pt: \n', tensor([ 48.4674, 5.6901, -0.0979])) ('ul: \n', [45, 0]) ('br: \n', [52, 7]Traceback (most recent call last): File "train.py", line 278, in main(parser.parse_args()) File "train.py", line 90, in main run(model, train_loader, mode, criterion_vox, criterion_coord, optimizer_G, optimizer_P) File "train.py", line 144, in run for i, (inputs, target, meta) in enumerate(data_loader): File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 272, in next return self._process_next_batch(batch) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 124, in draw_labelvolume img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]] ValueError: could not broadcast input array from shape (7,7) into shape (8,7) //============================================================================ @HongwenZhang Great appreciation for your help!

Jul 03 '19 09:07 JackLongKing

These values seem inconsistent with utils/imutils.py#L94-L95. sigma is 1 and int(5.6901 - 3 * 1) should be 2 for ul[1]? Could you carefully check and provide values at utils/imutils.py#L94 and img_x, img_y, g_x, g_y at utils/imutils.py#L119?

Jul 03 '19 09:07 HongwenZhang

Print code as follows: //================================================================== print("pt: {}\n".format(pt)) print("ul: {}\n".format(ul)) print("br: {}\n".format(br)) print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1])) print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1])) print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1])) print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1])) //================================================================== And ouput information as follows: //================================================================== pt: tensor([ 50.2262, 18.8357, -0.0273]) ul: [47, 15] br: [54, 22] g_x[0]: 0,g_x[1]: 7 g_y[0]: 0,g_y[1]: 7 img_x[0]: 47,img_x[1]: 54 img_y[0]: 15,img_y[1]: 22

pt: tensor([ 49.ValueError: Traceback (most recent call last): File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 130, in draw_labelvolume img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]] ValueError: could not broadcast input array from shape (7,7) into shape (7,8) //================================================================== From the output information, maybe this is caused by pt ? @HongwenZhang

Jul 04 '19 08:07 JackLongKing

These values are so weird. Given these values, both img[15:22, 47:54] and g[0:7, 0:7] should have the same shape of (7,7). So, I think it's better to replace utils/imutils.py#L119 with the following code for debugging.

try:
    img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
except:
    print('something wrong happened.\n')
    print('pt: {}\n'.format(pt))
    print('ul: {}\n'.format(ul))
    print('br: {}\n'.format(br))
    print('sigma: {}\n'.format(sigma))
    print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1]))
    print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1]))
    print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1]))
    print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1]))
    print('img shape: {}\n'.format(img.shape))
    print('g shape: {}\n'.format(g.shape))
    raise

Jul 04 '19 09:07 HongwenZhang

Yes, try...except was used in utils/imutils.py, and then met another problem, out of memory, which needs another try. My device is Titan X(12GB). My log as follows and thank you for your help! @HongwenZhang //================================================================= ==> creating model: stacks=4, blocks=1, z-res=[1, 2, 4, 64] coarse to fine mode: True p2v params: 13.01M v2c params: 19.46M using ADAM optimizer.

Epoch: 1 | LR: 0.00025000 pre_training... train.py:201: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number losses_vox.update(loss_vox.data[0], inputs.size(0)) train.py:202: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number losses_coord.update(loss_coord.data[0], inputs.size(0)) train.py:217: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number loss='vox: {:.4f}; coord: {:.4f}'.format(loss_vox.data[0], loss_coord.data[0]), train.py:122: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. input_var = torch.autograd.Variable(inputs.cuda(), volatile=True) train.py:124: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. range(len(target))] train.py:125: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. coord_var = torch.autograd.Variable(meta['tpts_inp'].cuda(async=True), volatile=True) THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f57d4ad7fd0>> ignored Traceback (most recent call last): File "train.py", line 278, in main(parser.parse_args()) File "train.py", line 95, in main optimizer_P) File "train.py", line 151, in run pred_vox, _, pred_coord = model.forward(input_var) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/models/pix2vox2coord.py", line 55, in forward vox_list = self.pix2vox(x) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply raise output RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58 //=================================================================

Jul 04 '19 12:07 JackLongKing

The error of 'out of memory' is out of the scope of this issue. To reproduce the bug occurred in the dataloader, we can bypass the forward of the network by adding continue at train.py#L145.

Jul 04 '19 13:07 HongwenZhang

JVCR-3Dlandmark JVCR-3Dlandmark copied to clipboard

A random bug

JVCR-3Dlandmark
JVCR-3Dlandmark copied to clipboard