pytorch-pose
pytorch-pose copied to clipboard
validation acc drops down drastically after epoch 10
I just run a very simple hg8 architecture, the log is as following.
==> creating model 'hg', stacks=8, blocks=1
Total params: 25.59M
Mean: 0.4404, 0.4440, 0.4327
Std: 0.2458, 0.2410, 0.2468
Epoch: 1 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000182s | Batch: 2.038s | Total: 0:20:48 | ETA: 0:00:01 | Loss: 0.0063 | Acc: 0.1950
Processing |################################| (493/493) Data: 0.000143s | Batch: 0.154s | Total: 0:01:15 | ETA: 0:00:01 | Loss: 0.0077 | Acc: 0.3638
Epoch: 2 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000244s | Batch: 0.281s | Total: 0:20:44 | ETA: 0:00:01 | Loss: 0.0052 | Acc: 0.3876
Processing |################################| (493/493) Data: 0.000139s | Batch: 0.140s | Total: 0:01:09 | ETA: 0:00:01 | Loss: 0.0072 | Acc: 0.5017
Epoch: 3 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000260s | Batch: 0.279s | Total: 0:20:39 | ETA: 0:00:01 | Loss: 0.0048 | Acc: 0.5024
Processing |################################| (493/493) Data: 0.000133s | Batch: 0.141s | Total: 0:01:09 | ETA: 0:00:01 | Loss: 0.0064 | Acc: 0.5538
Epoch: 4 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000276s | Batch: 0.277s | Total: 0:20:33 | ETA: 0:00:01 | Loss: 0.0046 | Acc: 0.5604
Processing |################################| (493/493) Data: 0.000131s | Batch: 0.133s | Total: 0:01:05 | ETA: 0:00:01 | Loss: 0.0055 | Acc: 0.6337
Epoch: 5 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000401s | Batch: 0.286s | Total: 0:20:35 | ETA: 0:00:01 | Loss: 0.0044 | Acc: 0.6009
Processing |################################| (493/493) Data: 0.000134s | Batch: 0.130s | Total: 0:01:04 | ETA: 0:00:01 | Loss: 0.0049 | Acc: 0.6572
Epoch: 6 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000247s | Batch: 0.283s | Total: 0:20:32 | ETA: 0:00:01 | Loss: 0.0043 | Acc: 0.6289
Processing |################################| (493/493) Data: 0.000095s | Batch: 0.131s | Total: 0:01:04 | ETA: 0:00:01 | Loss: 0.0046 | Acc: 0.6670
Epoch: 7 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000177s | Batch: 0.252s | Total: 0:20:42 | ETA: 0:00:01 | Loss: 0.0042 | Acc: 0.6469
Processing |################################| (493/493) Data: 0.000153s | Batch: 0.128s | Total: 0:01:03 | ETA: 0:00:01 | Loss: 0.0046 | Acc: 0.6934
Epoch: 8 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000222s | Batch: 0.282s | Total: 0:20:23 | ETA: 0:00:01 | Loss: 0.0041 | Acc: 0.6661
Processing |################################| (493/493) Data: 0.000157s | Batch: 0.130s | Total: 0:01:04 | ETA: 0:00:01 | Loss: 0.0056 | Acc: 0.6942
Epoch: 9 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000262s | Batch: 0.276s | Total: 0:20:26 | ETA: 0:00:01 | Loss: 0.0040 | Acc: 0.6812
Processing |################################| (493/493) Data: 0.000144s | Batch: 0.149s | Total: 0:01:13 | ETA: 0:00:01 | Loss: 0.0068 | Acc: 0.6930
Epoch: 10 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000206s | Batch: 0.254s | Total: 0:20:36 | ETA: 0:00:01 | Loss: 0.0039 | Acc: 0.6923
Processing |################################| (493/493) Data: 0.000211s | Batch: 0.149s | Total: 0:01:13 | ETA: 0:00:01 | Loss: 0.0076 | Acc: 0.7049
Epoch: 11 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000268s | Batch: 0.290s | Total: 0:20:27 | ETA: 0:00:01 | Loss: 0.0039 | Acc: 0.7046
Processing |################################| (493/493) Data: 0.000173s | Batch: 0.142s | Total: 0:01:09 | ETA: 0:00:01 | Loss: 0.0095 | Acc: 0.7010
Epoch: 12 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000269s | Batch: 0.288s | Total: 0:20:57 | ETA: 0:00:01 | Loss: 0.0038 | Acc: 0.7108
Processing |################################| (493/493) Data: 0.000177s | Batch: 0.136s | Total: 0:01:07 | ETA: 0:00:01 | Loss: 0.0138 | Acc: 0.6223
Epoch: 13 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000226s | Batch: 0.253s | Total: 0:20:23 | ETA: 0:00:01 | Loss: 0.0037 | Acc: 0.7201
Processing |################################| (493/493) Data: 0.000237s | Batch: 0.140s | Total: 0:01:08 | ETA: 0:00:01 | Loss: 0.0221 | Acc: 0.5394
Epoch: 14 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000273s | Batch: 0.293s | Total: 0:20:31 | ETA: 0:00:01 | Loss: 0.0037 | Acc: 0.7289
Processing |################################| (493/493) Data: 0.000174s | Batch: 0.130s | Total: 0:01:04 | ETA: 0:00:01 | Loss: 0.0315 | Acc: 0.3212
Epoch: 15 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000453s | Batch: 0.321s | Total: 0:20:47 | ETA: 0:00:01 | Loss: 0.0036 | Acc: 0.7355
Processing |################################| (493/493) Data: 0.000147s | Batch: 0.149s | Total: 0:01:13 | ETA: 0:00:01 | Loss: 0.0528 | Acc: 0.0971
Epoch: 16 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000270s | Batch: 0.280s | Total: 0:20:42 | ETA: 0:00:01 | Loss: 0.0036 | Acc: 0.7417
Processing |################################| (493/493) Data: 0.000178s | Batch: 0.129s | Total: 0:01:03 | ETA: 0:00:01 | Loss: 0.0900 | Acc: 0.0151
Epoch: 17 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000268s | Batch: 0.289s | Total: 0:20:28 | ETA: 0:00:01 | Loss: 0.0035 | Acc: 0.7481
Processing |################################| (493/493) Data: 0.000145s | Batch: 0.148s | Total: 0:01:13 | ETA: 0:00:01 | Loss: 0.1890 | Acc: 0.0089
Epoch: 18 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000220s | Batch: 0.276s | Total: 0:20:24 | ETA: 0:00:01 | Loss: 0.0035 | Acc: 0.7525
Processing |################################| (493/493) Data: 0.000082s | Batch: 0.136s | Total: 0:01:06 | ETA: 0:00:01 | Loss: 0.3065 | Acc: 0.0000
Epoch: 19 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000186s | Batch: 0.275s | Total: 0:20:02 | ETA: 0:00:01 | Loss: 0.0035 | Acc: 0.7589
Processing |################################| (493/493) Data: 0.000080s | Batch: 0.135s | Total: 0:01:06 | ETA: 0:00:01 | Loss: 1.0547 | Acc: 0.0015
Epoch: 20 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000127s | Batch: 0.240s | Total: 0:20:19 | ETA: 0:00:01 | Loss: 0.0034 | Acc: 0.7641
Processing |################################| (493/493) Data: 0.000147s | Batch: 0.130s | Total: 0:01:03 | ETA: 0:00:01 | Loss: 1.7841 | Acc: 0.0019
Epoch: 21 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000245s | Batch: 0.271s | Total: 0:20:13 | ETA: 0:00:01 | Loss: 0.0034 | Acc: 0.7690
Processing |################################| (493/493) Data: 0.000080s | Batch: 0.128s | Total: 0:01:02 | ETA: 0:00:01 | Loss: 4.3475 | Acc: 0.0000
Epoch: 22 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000315s | Batch: 0.295s | Total: 0:20:08 | ETA: 0:00:01 | Loss: 0.0034 | Acc: 0.7716
Processing |################################| (493/493) Data: 0.000086s | Batch: 0.136s | Total: 0:01:07 | ETA: 0:00:01 | Loss: 11.9544 | Acc: 0.0029
Epoch: 23 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000165s | Batch: 0.261s | Total: 0:20:04 | ETA: 0:00:01 | Loss: 0.0034 | Acc: 0.7757
Processing |################################| (493/493) Data: 0.000140s | Batch: 0.141s | Total: 0:01:09 | ETA: 0:00:01 | Loss: 22.9730 | Acc: 0.0000
Epoch: 24 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000216s | Batch: 0.267s | Total: 0:20:06 | ETA: 0:00:01 | Loss: 0.0034 | Acc: 0.7793
Processing |################################| (493/493) Data: 0.000123s | Batch: 0.137s | Total: 0:01:07 | ETA: 0:00:01 | Loss: 141.7624 | Acc: 0.0000
Hmm, seems this problem cannot be reproduced. Would you mind to train again and see whether everything goes well?
Hey, I've met just the same problem with you. I tried to train a 2 stacks HG network with the original code and params. I guess this is cause by strong over-fitting, but I am not sure why overfitting occurs. Do you have any idea? Here's the log: log.txt
Hi, @gdwei @bearpaw I have found the same problem in my side. I used the model having 8 hourglass modules
Epoch LR Train Loss Val Loss Train Acc Val Acc
1.000000 0.000250 0.006231 0.008109 0.194155 0.332968
2.000000 0.000250 0.005188 0.006057 0.387743 0.477342
3.000000 0.000250 0.004838 0.005032 0.502596 0.584106
4.000000 0.000250 0.004606 0.004787 0.562090 0.629260
5.000000 0.000250 0.004426 0.004789 0.600115 0.638421
6.000000 0.000250 0.004286 0.004692 0.627019 0.674266
7.000000 0.000250 0.004173 0.004733 0.649596 0.681682
8.000000 0.000250 0.004089 0.005544 0.662832 0.644043
9.000000 0.000250 0.004001 0.005081 0.680730 0.703755
10.000000 0.000250 0.003925 0.005816 0.692677 0.705782
11.000000 0.000250 0.003865 0.005736 0.702876 0.713184
12.000000 0.000250 0.003804 0.007214 0.713316 0.689739
13.000000 0.000250 0.003744 0.009516 0.722215 0.716273
14.000000 0.000250 0.003682 0.016769 0.731847 0.655829
15.000000 0.000250 0.003640 0.026813 0.735956 0.637782
16.000000 0.000250 0.003587 0.033836 0.743873 0.287533
17.000000 0.000250 0.003552 0.055812 0.747483 0.110421
18.000000 0.000250 0.003506 0.090679 0.754163 0.026939
19.000000 0.000250 0.003469 0.246852 0.760248 0.052983
20.000000 0.000250 0.003439 0.478084 0.763902 0.020653
This is my log.
@djangogo for me, setting a smaller learning rate could help, for example, you may set it to be around 1e-5, other tricks on adjusting learning rate could also be helpful.
hello it looks like i also have same problem
Processing |################################| (944/944) Data: 0.000173s | Batch: 0.141s | Total: 0:02:12 | ETA: 0:00:01 | Loss: 4.7316 | Acc: 0.0011
Epoch: 3 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.001675s | Batch: 0.322s | Total: 2:39:51 | ETA: 0:00:01 | Loss: 0.0036 | Acc: 0.5417
Processing |################################| (944/944) Data: 0.000312s | Batch: 0.147s | Total: 0:02:18 | ETA: 0:00:01 | Loss: 5096.5776 | Acc: 0.0000
Epoch: 4 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.001971s | Batch: 0.333s | Total: 2:39:24 | ETA: 0:00:01 | Loss: 0.0036 | Acc: 0.5267
Processing |################################| (944/944) Data: 0.000199s | Batch: 0.148s | Total: 0:02:19 | ETA: 0:00:01 | Loss: 53171.3798 | Acc: 0.0021
Epoch: 5 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.000199s | Batch: 0.324s | Total: 2:40:41 | ETA: 0:00:01 | Loss: 0.0035 | Acc: 0.5406
Processing |################################| (944/944) Data: 0.000270s | Batch: 0.147s | Total: 0:02:18 | ETA: 0:00:01 | Loss: 166093.0824 | Acc: 0.0000
Epoch: 6 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.001795s | Batch: 0.326s | Total: 2:39:21 | ETA: 0:00:01 | Loss: 0.0035 | Acc: 0.5556
Processing |################################| (944/944) Data: 0.000197s | Batch: 0.144s | Total: 0:02:16 | ETA: 0:00:01 | Loss: 808754.2019 | Acc: 0.0001
Epoch: 7 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.002017s | Batch: 0.342s | Total: 2:40:26 | ETA: 0:00:01 | Loss: 0.0035 | Acc: 0.5202
Processing |################################| (944/944) Data: 0.000228s | Batch: 0.147s | Total: 0:02:18 | ETA: 0:00:01 | Loss: 377698.3226 | Acc: 0.0000
Epoch: 8 | LR: 0.00025000
Processing |########## | (7420/22333) Data: 0.002047s | Batch: 0.424s | Total: 0:53:10 | ETA: 1:47:54 | Loss: 0.0038 | Acc: 0.3522^CProcess Process-15:
@gdwei could you report your result when you set lr=1e-5?
In our lab., two researchers tried to run the codes, but only one researcher had the same problem. We have kept training over and over, but the result is similar. We utilized same code(without modifying anyting), and same data(copy the data from one's environment to another). Anybody has idea???
Env. generating problem Ubuntu 16.04, python 2.7.13, PyTorch 0.3.0, openCV 3.3.0., 1080Ti
Env. no problem Ubuntu 16.04, python 2.7.12, PyTorch 0.3.0., openCV 3.3.0., GTX TITAN
(added) At the environment which makes problem, the hourglass successfully trained by learning rate 1e-4. Pls check.
Hey guys, I have the same problem too, if I finetune the 8 stack hg network to train a new dataset (about 30K images), the validation accuracy drops down dramaticly after only 4 epochs.
I use lr 2.5e-4
I am trying 1e-4, let's see what will happen
I use 2.5e-5 now, and I only restore the bottom 4 stacks to start the 8 stack training. It seems that overfitting still exists, but I think actually the problem is not that serious as the printed accuracy shows. I wonder the valid acc drops like a disaster because this code use a threshold when caculating accuracy
def dist_acc(dists, thr=0.5):
''' Return percentage below threshold while ignoring values with a -1 '''
if dists.ne(-1).sum() > 0:
return dists.le(thr).eq(dists.ne(-1)).sum()*1.0 / dists.ne(-1).sum()
The solution "xingyizhou" metioned works for me. On the environment where Pytorch 0.4.0 makes problem, i reinstall Pytorch 0.2.0, and then the learning finally works at the end without any problem.
However, i still cannot understand why Pytorch 0.4.0 doesnt work with certain computing environment with 1080Ti, TITAN X.
I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.
I'm experiencing the same issue running the MPII example with PyTorch 0.3.1, a 1080 and with default parameters except stacks=1.
Hi all @dongzhuoyao @gdwei @djangogo @salihkaragoz @Ben-Park @gdjmck @wisp5 @rockeyben, Thanks for the report! In short, down-grading the pytorch version to 0.1.12 will resolve this bug. However it's not an elegant way to do so. I have investigated this bug for some time (See https://github.com/xingyizhou/pytorch-pose-hg-3d/issues/16) but it is still unresolved. Recently I found this bug also occurs in other architecture besides HourglassNet (but still for dense output) on pytorch version > 0.1.12, while v0.1.12 always works fine. I also found a similar bug report at https://discuss.pytorch.org/t/model-eval-gives-incorrect-loss-for-model-with-batchnorm-layers/7561/2, which also use an MSE/L1 loss.
A natural conjecture is that the bug might come from a bug of pytorch BN implementation after 0.1.12 and might occur when applying a network with dense output and BN layer (but not reproducible). The bug might less likely from the data processing or hourglass implementation since this repo and mine implementation (https://github.com/xingyizhou/pytorch-pose-hg-3d/tree/2D) are independent. Please correct me if you have any counterexample or you have more progress on this bug. You are welcomed to discuss more about this bug with me by dropping me an email at [email protected] . Thanks!
Hi all, As pointed by @leoxiaobin, turn off cudnn of BN layer resolves the issue. It can be realized by set torch.backends.cudnn.enabled = False in main.py, which disables cudnn for all layers and slows down the training by about 1.5x time, or re-build pytorch from source by hacking cudnn in BN layers https://github.com/pytorch/pytorch/blob/e8536c08a16b533fe0a9d645dd4255513f9f4fdd/aten/src/ATen/native/Normalization.cpp#L46 .
Thanks @xingyizhou
I make it concrete.
for pytorch's high version (such as 0.4.0 or 0.4.1)
go to python's directory in your system
For windows:
and change the batch_norm in PYTHONDIR/Lib/site-packages/torch/nn/functional.py
For linux:
change the batch_norm in /usr/lib/python2.7/dist-packages/torch/nn/functional.py
or /usr/lib/python3.5/dist-packages/torch/nn/functional.py
def batch_norm(input, running_mean, running_var, weight=None, bias=None,
training=False, momentum=0.1, eps=1e-5):
r"""Applies Batch Normalization for each channel across a batch of data.
See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`,
:class:`~torch.nn.BatchNorm3d` for details.
"""
if training:
size = list(input.size())
if reduce(mul, size[2:], size[0]) == 1:
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
return torch.batch_norm(
input, weight, bias, running_mean, running_var,
training, momentum, eps, torch.backends.cudnn.enabled
)
change to
def batch_norm(input, running_mean, running_var, weight=None, bias=None,
training=False, momentum=0.1, eps=1e-5):
r"""Applies Batch Normalization for each channel across a batch of data.
See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`,
:class:`~torch.nn.BatchNorm3d` for details.
"""
if training:
size = list(input.size())
if reduce(mul, size[2:], size[0]) == 1:
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
return torch.batch_norm(
input, weight, bias, running_mean, running_var,
training, momentum, eps, False
)
hope it will help others. and I also provide a bash script to patch pytorch automatically(if you use python 3.x, just change it )
#!/usr/bin/env bash
set -e
# patch for BN for pytorch after v0.1.12
PYPATH=/usr/local/lib/python2.7/dist-packages
PYTORCH_VERSION=`python -c "import torch as t; print(t.__version__)"`
if [ ! -e "${PYPATH}/torch/nn/functional.py.bak" ] && [ -e "${PYPATH}/torch/nn/functional.py" ]; then
# backup
sudo cp ${PYPATH}/torch/nn/functional.py ${PYPATH}/torch/nn/functional.py.bak
# patch pytorch
if [ "${PYTORCH_VERSION}" == "0.4.0" ]; then
# for pytorch v0.4.0
sudo sed -i "1194s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py
elif [ "${PYTORCH_VERSION}" == "0.4.1" ]; then
# for pytorch v0.4.1
sudo sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py
fi
echo "patch pytorch ${PYTORCH_VERSION} successfully"
else
echo "You have patched the pytorch!"
fi
Thanks @xingyizhou I make it concrete. for pytorch's high version (such as 0.4.0 or 0.4.1) go to python's directory in your system For windows: and change the batch_norm in
PYTHONDIR/Lib/site-packages/torch/nn/functional.py
For linux: change the batch_norm in/usr/lib/python2.7/dist-packages/torch/nn/functional.py
or/usr/lib/python3.5/dist-packages/torch/nn/functional.py
def batch_norm(input, running_mean, running_var, weight=None, bias=None, training=False, momentum=0.1, eps=1e-5): r"""Applies Batch Normalization for each channel across a batch of data. See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`, :class:`~torch.nn.BatchNorm3d` for details. """ if training: size = list(input.size()) if reduce(mul, size[2:], size[0]) == 1: raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) return torch.batch_norm( input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled )
change to
def batch_norm(input, running_mean, running_var, weight=None, bias=None, training=False, momentum=0.1, eps=1e-5): r"""Applies Batch Normalization for each channel across a batch of data. See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`, :class:`~torch.nn.BatchNorm3d` for details. """ if training: size = list(input.size()) if reduce(mul, size[2:], size[0]) == 1: raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) return torch.batch_norm( input, weight, bias, running_mean, running_var, training, momentum, eps, False )
hope it will help others. and I also provide a bash script to patch pytorch automatically(if you use python 3.x, just change it )
#!/usr/bin/env bash set -e # patch for BN for pytorch after v0.1.12 PYPATH=/usr/local/lib/python2.7/dist-packages PYTORCH_VERSION=`python -c "import torch as t; print(t.__version__)"` if [ ! -e "${PYPATH}/torch/nn/functional.py.bak" ] && [ -e "${PYPATH}/torch/nn/functional.py" ]; then # backup sudo cp ${PYPATH}/torch/nn/functional.py ${PYPATH}/torch/nn/functional.py.bak # patch pytorch if [ "${PYTORCH_VERSION}" == "0.4.0" ]; then # for pytorch v0.4.0 sudo sed -i "1194s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py elif [ "${PYTORCH_VERSION}" == "0.4.1" ]; then # for pytorch v0.4.1 sudo sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py fi echo "patch pytorch ${PYTORCH_VERSION} successfully" else echo "You have patched the pytorch!" fi
Thanks @xingyizhou I make it concrete. for pytorch's high version (such as 0.4.0 or 0.4.1) go to python's directory in your system For windows: and change the batch_norm in
PYTHONDIR/Lib/site-packages/torch/nn/functional.py
For linux: change the batch_norm in/usr/lib/python2.7/dist-packages/torch/nn/functional.py
or/usr/lib/python3.5/dist-packages/torch/nn/functional.py
def batch_norm(input, running_mean, running_var, weight=None, bias=None, training=False, momentum=0.1, eps=1e-5): r"""Applies Batch Normalization for each channel across a batch of data. See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`, :class:`~torch.nn.BatchNorm3d` for details. """ if training: size = list(input.size()) if reduce(mul, size[2:], size[0]) == 1: raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) return torch.batch_norm( input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled )
change to
def batch_norm(input, running_mean, running_var, weight=None, bias=None, training=False, momentum=0.1, eps=1e-5): r"""Applies Batch Normalization for each channel across a batch of data. See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`, :class:`~torch.nn.BatchNorm3d` for details. """ if training: size = list(input.size()) if reduce(mul, size[2:], size[0]) == 1: raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size)) return torch.batch_norm( input, weight, bias, running_mean, running_var, training, momentum, eps, False )
hope it will help others. and I also provide a bash script to patch pytorch automatically(if you use python 3.x, just change it )
#!/usr/bin/env bash set -e # patch for BN for pytorch after v0.1.12 PYPATH=/usr/local/lib/python2.7/dist-packages PYTORCH_VERSION=`python -c "import torch as t; print(t.__version__)"` if [ ! -e "${PYPATH}/torch/nn/functional.py.bak" ] && [ -e "${PYPATH}/torch/nn/functional.py" ]; then # backup sudo cp ${PYPATH}/torch/nn/functional.py ${PYPATH}/torch/nn/functional.py.bak # patch pytorch if [ "${PYTORCH_VERSION}" == "0.4.0" ]; then # for pytorch v0.4.0 sudo sed -i "1194s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py elif [ "${PYTORCH_VERSION}" == "0.4.1" ]; then # for pytorch v0.4.1 sudo sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py fi echo "patch pytorch ${PYTORCH_VERSION} successfully" else echo "You have patched the pytorch!" fi
Huge Thanks for your fix. I was running into this issue as well, your fix seems to help
hello everyone,I have a problem when i was training the model from scratch. The model have 2 stacks HG network with the original code and params.And the lr is 2.5e-4. Can somebody can help me to solve the problem?Thanks!
Epoch LR Train Loss Val Loss Train Acc Val Acc 1.000000 0.000250 0.000820 0.001804 0.012718 0.016598 2.000000 0.000250 0.000605 0.001752 0.023349 0.018282 3.000000 0.000250 0.000601 0.003543 0.026636 0.016567 4.000000 0.000250 0.000601 0.002009 0.030169 0.023730 5.000000 0.000250 0.000605 0.007554 0.021984 0.024761 6.000000 0.000250 0.000593 0.001323 0.021573 0.023211 6.000000 0.000250 0.000581 0.000750 0.030056 0.037204 7.000000 0.000250 0.000579 0.001724 0.062914 0.042289 8.000000 0.000250 0.000574 0.006119 0.078612 0.029493 9.000000 0.000250 0.000568 0.002073 0.092811 0.032917 10.000000 0.000250 0.000565 0.002764 0.103415 0.082355 11.000000 0.000250 0.000559 0.004456 0.118935 0.069083 12.000000 0.000250 0.000554 0.001235 0.136579 0.111532 13.000000 0.000250 0.000551 0.001291 0.157845 0.139160 14.000000 0.000250 0.000546 0.000833 0.172080 0.187071 15.000000 0.000250 0.000540 0.000677 0.188202 0.137926 16.000000 0.000250 0.000536 0.000822 0.204400 0.126236 17.000000 0.000250 0.000529 0.007549 0.223867 0.023203 18.000000 0.000250 0.000514 0.001865 0.248268 0.100743 19.000000 0.000250 0.000500 0.001187 0.281679 0.162482 20.000000 0.000250 0.000491 0.002932 0.311082 0.045916 21.000000 0.000250 0.000484 0.001115 0.335782 0.107354 22.000000 0.000250 0.000476 0.009399 0.356727 0.008605 23.000000 0.000250 0.000470 0.000646 0.369132 0.005161 24.000000 0.000250 0.000463 0.003118 0.386070 0.022706 25.000000 0.000250 0.000457 0.000577 0.399309 0.018974 26.000000 0.000250 0.000451 0.000582 0.417991 0.019046 27.000000 0.000250 0.000446 0.001388 0.432328 0.010517 28.000000 0.000250 0.000441 0.000769 0.444523 0.019780 29.000000 0.000250 0.000436 0.000739 0.456251 0.014726 30.000000 0.000250 0.000432 0.001276 0.469130 0.056828 31.000000 0.000250 0.000428 0.001579 0.478094 0.023356 32.000000 0.000250 0.000423 0.000569 0.491334 0.006764 33.000000 0.000250 0.000420 0.000907 0.499913 0.045504 34.000000 0.000250 0.000416 0.000600 0.508063 0.101544 35.000000 0.000250 0.000412 0.000581 0.516998 0.077281 36.000000 0.000250 0.000408 0.000618 0.525647 0.047941 37.000000 0.000250 0.000404 0.000635 0.534216 0.036322 38.000000 0.000250 0.000402 0.000749 0.539956 0.002505 39.000000 0.000250 0.000401 0.000553 0.542420 0.070047 40.000000 0.000250 0.000396 0.000551 0.552550 0.038140 41.000000 0.000250 0.000393 0.000577 0.558081 0.025742 42.000000 0.000250 0.000390 0.000560 0.564162 0.031878 43.000000 0.000250 0.000387 0.000560 0.569910 0.008823 44.000000 0.000250 0.000384 0.000576 0.575056 0.007596 45.000000 0.000250 0.000381 0.000549 0.581056 0.003241 46.000000 0.000250 0.000379 0.000550 0.584639 0.000000 47.000000 0.000250 0.000377 0.000561 0.589435 0.026145 48.000000 0.000250 0.000374 0.000548 0.593368 0.020744
I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.
Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.
I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.
Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.
Oh, it was something about the accuarcy method that it only return 1 or 0 for the batch sent in, so with small batch size you'll have higher probability to get them all right and get 1 otherwise you get 0.
I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.
Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.
Oh, it was something about the accuarcy method that it only return 1 or 0 for the batch sent in, so with small batch size you'll have higher probability to get them all right and get 1 otherwise you get 0.
So it is caused by your wrong evaluation code, not because of other reasons ?
I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.
Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.
Oh, it was something about the accuarcy method that it only return 1 or 0 for the batch sent in, so with small batch size you'll have higher probability to get them all right and get 1 otherwise you get 0.
So it is caused by your wrong evaluation code, not because of other reasons ?
Yeah, I used the evaluation code of the source code back then, maybe it conflicted with the PyTorch version and it worked all right when I fixed it. Did your performance on testset differs with different batch size?
I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.
Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.
Oh, it was something about the accuarcy method that it only return 1 or 0 for the batch sent in, so with small batch size you'll have higher probability to get them all right and get 1 otherwise you get 0.
So it is caused by your wrong evaluation code, not because of other reasons ?
Yeah, I used the evaluation code of the source code back then, maybe it conflicted with the PyTorch version and it worked all right when I fixed it. Did your performance on testset differs with different batch size?
Yes, but I am using other dataset to finetune hourglass model on that. My model works well on MPII but quite worse when finetuning on that dataset, at least a big difference with the results provided by other researchers. But their code is written in torch, so now I am searching for anything that can help.