JVCR-3Dlandmark
JVCR-3Dlandmark copied to clipboard
A random bug
Hi everyone,
When I train the net, I got a random bug. An error will occur in random bench
Processing |########################## | (50860/61225) Data: 2.597300s | Batch: 3.278s | Total: 0:56:45 |Processing |########################## | (50880/61225) Data: 0.000299s | Batch: 0.681s | Total: 0:56:46 |Processing |########################## | (50900/61225) Data: 0.000489s | Batch: 0.691s | Total: 0:56:47 |Processing |########################## | (50920/61225) Data: 0.000502s | Batch: 0.683s | Total: 0:56:47 |Processing |########################## | (50940/61225) Data: 2.483688s | Batch: 3.165s | Total: 0:56:50 | ETA: 0:10:09 | LOSS vox: 0.0337; coord: 0.0034 | NME: 0.3116Traceback (most recent call last):
File "train.py", line 281, in
So, what's the problem?
Replacing the int()
with the np.int()
in utils/imutils.py#L94-L95 may solve this problem.
I meet this problem too. And modify int to np.int, this error still happens. I use pytorch0.4.0. Hope help! @HongwenZhang
Did you solve this problem? @liumarcus70s
Hi @JackLongKing, Could you print the value of ul
, br
, and pt
when the bug occurs?
Information Flow as follows:
//============================================================================
('pt: \n', tensor([ 48.4674, 5.6901, -0.0979]))
('ul: \n', [45, 0])
('br: \n', [52, 7]Traceback (most recent call last):
File "train.py", line 278, in
These values seem inconsistent with utils/imutils.py#L94-L95.
sigma
is 1 and int(5.6901 - 3 * 1)
should be 2 for ul[1]
?
Could you carefully check and provide values at utils/imutils.py#L94 and img_x, img_y, g_x, g_y
at utils/imutils.py#L119?
Print code as follows: //================================================================== print("pt: {}\n".format(pt)) print("ul: {}\n".format(ul)) print("br: {}\n".format(br)) print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1])) print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1])) print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1])) print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1])) //================================================================== And ouput information as follows: //================================================================== pt: tensor([ 50.2262, 18.8357, -0.0273]) ul: [47, 15] br: [54, 22] g_x[0]: 0,g_x[1]: 7 g_y[0]: 0,g_y[1]: 7 img_x[0]: 47,img_x[1]: 54 img_y[0]: 15,img_y[1]: 22
pt: tensor([ 49.ValueError: Traceback (most recent call last): File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 130, in draw_labelvolume img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]] ValueError: could not broadcast input array from shape (7,7) into shape (7,8) //================================================================== From the output information, maybe this is caused by pt ? @HongwenZhang
These values are so weird. Given these values, both img[15:22, 47:54]
and g[0:7, 0:7]
should have the same shape of (7,7)
.
So, I think it's better to replace utils/imutils.py#L119 with the following code for debugging.
try:
img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
except:
print('something wrong happened.\n')
print('pt: {}\n'.format(pt))
print('ul: {}\n'.format(ul))
print('br: {}\n'.format(br))
print('sigma: {}\n'.format(sigma))
print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1]))
print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1]))
print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1]))
print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1]))
print('img shape: {}\n'.format(img.shape))
print('g shape: {}\n'.format(g.shape))
raise
Yes, try...except was used in utils/imutils.py, and then met another problem, out of memory, which needs another try. My device is Titan X(12GB). My log as follows and thank you for your help! @HongwenZhang //================================================================= ==> creating model: stacks=4, blocks=1, z-res=[1, 2, 4, 64] coarse to fine mode: True p2v params: 13.01M v2c params: 19.46M using ADAM optimizer.
Epoch: 1 | LR: 0.00025000
pre_training...
train.py:201: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
losses_vox.update(loss_vox.data[0], inputs.size(0))
train.py:202: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
losses_coord.update(loss_coord.data[0], inputs.size(0))
train.py:217: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
loss='vox: {:.4f}; coord: {:.4f}'.format(loss_vox.data[0], loss_coord.data[0]),
train.py:122: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad():
instead.
input_var = torch.autograd.Variable(inputs.cuda(), volatile=True)
train.py:124: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad():
instead.
range(len(target))]
train.py:125: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad():
instead.
coord_var = torch.autograd.Variable(meta['tpts_inp'].cuda(async=True), volatile=True)
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f57d4ad7fd0>> ignored
Traceback (most recent call last):
File "train.py", line 278, in
The error of 'out of memory' is out of the scope of this issue.
To reproduce the bug occurred in the dataloader, we can bypass the forward of the network by adding continue
at train.py#L145.