pytorch-pose validation acc drops down drastically after epoch 10

I just run a very simple hg8 architecture, the log is as following.

==> creating model 'hg', stacks=8, blocks=1
    Total params: 25.59M
    Mean: 0.4404, 0.4440, 0.4327
    Std:  0.2458, 0.2410, 0.2468

Epoch: 1 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000182s | Batch: 2.038s | Total: 0:20:48 | ETA: 0:00:01 | Loss: 0.0063 | Acc:  0.1950
Processing |################################| (493/493) Data: 0.000143s | Batch: 0.154s | Total: 0:01:15 | ETA: 0:00:01 | Loss: 0.0077 | Acc:  0.3638

Epoch: 2 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000244s | Batch: 0.281s | Total: 0:20:44 | ETA: 0:00:01 | Loss: 0.0052 | Acc:  0.3876
Processing |################################| (493/493) Data: 0.000139s | Batch: 0.140s | Total: 0:01:09 | ETA: 0:00:01 | Loss: 0.0072 | Acc:  0.5017

Epoch: 3 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000260s | Batch: 0.279s | Total: 0:20:39 | ETA: 0:00:01 | Loss: 0.0048 | Acc:  0.5024
Processing |################################| (493/493) Data: 0.000133s | Batch: 0.141s | Total: 0:01:09 | ETA: 0:00:01 | Loss: 0.0064 | Acc:  0.5538

Epoch: 4 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000276s | Batch: 0.277s | Total: 0:20:33 | ETA: 0:00:01 | Loss: 0.0046 | Acc:  0.5604
Processing |################################| (493/493) Data: 0.000131s | Batch: 0.133s | Total: 0:01:05 | ETA: 0:00:01 | Loss: 0.0055 | Acc:  0.6337

Epoch: 5 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000401s | Batch: 0.286s | Total: 0:20:35 | ETA: 0:00:01 | Loss: 0.0044 | Acc:  0.6009
Processing |################################| (493/493) Data: 0.000134s | Batch: 0.130s | Total: 0:01:04 | ETA: 0:00:01 | Loss: 0.0049 | Acc:  0.6572

Epoch: 6 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000247s | Batch: 0.283s | Total: 0:20:32 | ETA: 0:00:01 | Loss: 0.0043 | Acc:  0.6289
Processing |################################| (493/493) Data: 0.000095s | Batch: 0.131s | Total: 0:01:04 | ETA: 0:00:01 | Loss: 0.0046 | Acc:  0.6670

Epoch: 7 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000177s | Batch: 0.252s | Total: 0:20:42 | ETA: 0:00:01 | Loss: 0.0042 | Acc:  0.6469
Processing |################################| (493/493) Data: 0.000153s | Batch: 0.128s | Total: 0:01:03 | ETA: 0:00:01 | Loss: 0.0046 | Acc:  0.6934

Epoch: 8 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000222s | Batch: 0.282s | Total: 0:20:23 | ETA: 0:00:01 | Loss: 0.0041 | Acc:  0.6661
Processing |################################| (493/493) Data: 0.000157s | Batch: 0.130s | Total: 0:01:04 | ETA: 0:00:01 | Loss: 0.0056 | Acc:  0.6942

Epoch: 9 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000262s | Batch: 0.276s | Total: 0:20:26 | ETA: 0:00:01 | Loss: 0.0040 | Acc:  0.6812
Processing |################################| (493/493) Data: 0.000144s | Batch: 0.149s | Total: 0:01:13 | ETA: 0:00:01 | Loss: 0.0068 | Acc:  0.6930

Epoch: 10 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000206s | Batch: 0.254s | Total: 0:20:36 | ETA: 0:00:01 | Loss: 0.0039 | Acc:  0.6923
Processing |################################| (493/493) Data: 0.000211s | Batch: 0.149s | Total: 0:01:13 | ETA: 0:00:01 | Loss: 0.0076 | Acc:  0.7049

Epoch: 11 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000268s | Batch: 0.290s | Total: 0:20:27 | ETA: 0:00:01 | Loss: 0.0039 | Acc:  0.7046
Processing |################################| (493/493) Data: 0.000173s | Batch: 0.142s | Total: 0:01:09 | ETA: 0:00:01 | Loss: 0.0095 | Acc:  0.7010

Epoch: 12 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000269s | Batch: 0.288s | Total: 0:20:57 | ETA: 0:00:01 | Loss: 0.0038 | Acc:  0.7108
Processing |################################| (493/493) Data: 0.000177s | Batch: 0.136s | Total: 0:01:07 | ETA: 0:00:01 | Loss: 0.0138 | Acc:  0.6223

Epoch: 13 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000226s | Batch: 0.253s | Total: 0:20:23 | ETA: 0:00:01 | Loss: 0.0037 | Acc:  0.7201
Processing |################################| (493/493) Data: 0.000237s | Batch: 0.140s | Total: 0:01:08 | ETA: 0:00:01 | Loss: 0.0221 | Acc:  0.5394

Epoch: 14 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000273s | Batch: 0.293s | Total: 0:20:31 | ETA: 0:00:01 | Loss: 0.0037 | Acc:  0.7289
Processing |################################| (493/493) Data: 0.000174s | Batch: 0.130s | Total: 0:01:04 | ETA: 0:00:01 | Loss: 0.0315 | Acc:  0.3212

Epoch: 15 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000453s | Batch: 0.321s | Total: 0:20:47 | ETA: 0:00:01 | Loss: 0.0036 | Acc:  0.7355
Processing |################################| (493/493) Data: 0.000147s | Batch: 0.149s | Total: 0:01:13 | ETA: 0:00:01 | Loss: 0.0528 | Acc:  0.0971

Epoch: 16 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000270s | Batch: 0.280s | Total: 0:20:42 | ETA: 0:00:01 | Loss: 0.0036 | Acc:  0.7417
Processing |################################| (493/493) Data: 0.000178s | Batch: 0.129s | Total: 0:01:03 | ETA: 0:00:01 | Loss: 0.0900 | Acc:  0.0151

Epoch: 17 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000268s | Batch: 0.289s | Total: 0:20:28 | ETA: 0:00:01 | Loss: 0.0035 | Acc:  0.7481
Processing |################################| (493/493) Data: 0.000145s | Batch: 0.148s | Total: 0:01:13 | ETA: 0:00:01 | Loss: 0.1890 | Acc:  0.0089

Epoch: 18 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000220s | Batch: 0.276s | Total: 0:20:24 | ETA: 0:00:01 | Loss: 0.0035 | Acc:  0.7525
Processing |################################| (493/493) Data: 0.000082s | Batch: 0.136s | Total: 0:01:06 | ETA: 0:00:01 | Loss: 0.3065 | Acc:  0.0000

Epoch: 19 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000186s | Batch: 0.275s | Total: 0:20:02 | ETA: 0:00:01 | Loss: 0.0035 | Acc:  0.7589
Processing |################################| (493/493) Data: 0.000080s | Batch: 0.135s | Total: 0:01:06 | ETA: 0:00:01 | Loss: 1.0547 | Acc:  0.0015

Epoch: 20 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000127s | Batch: 0.240s | Total: 0:20:19 | ETA: 0:00:01 | Loss: 0.0034 | Acc:  0.7641
Processing |################################| (493/493) Data: 0.000147s | Batch: 0.130s | Total: 0:01:03 | ETA: 0:00:01 | Loss: 1.7841 | Acc:  0.0019

Epoch: 21 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000245s | Batch: 0.271s | Total: 0:20:13 | ETA: 0:00:01 | Loss: 0.0034 | Acc:  0.7690
Processing |################################| (493/493) Data: 0.000080s | Batch: 0.128s | Total: 0:01:02 | ETA: 0:00:01 | Loss: 4.3475 | Acc:  0.0000

Epoch: 22 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000315s | Batch: 0.295s | Total: 0:20:08 | ETA: 0:00:01 | Loss: 0.0034 | Acc:  0.7716
Processing |################################| (493/493) Data: 0.000086s | Batch: 0.136s | Total: 0:01:07 | ETA: 0:00:01 | Loss: 11.9544 | Acc:  0.0029

Epoch: 23 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000165s | Batch: 0.261s | Total: 0:20:04 | ETA: 0:00:01 | Loss: 0.0034 | Acc:  0.7757
Processing |################################| (493/493) Data: 0.000140s | Batch: 0.141s | Total: 0:01:09 | ETA: 0:00:01 | Loss: 22.9730 | Acc:  0.0000

Epoch: 24 | LR: 0.00025000
Processing |################################| (3708/3708) Data: 0.000216s | Batch: 0.267s | Total: 0:20:06 | ETA: 0:00:01 | Loss: 0.0034 | Acc:  0.7793
Processing |################################| (493/493) Data: 0.000123s | Batch: 0.137s | Total: 0:01:07 | ETA: 0:00:01 | Loss: 141.7624 | Acc:  0.0000

Jan 02 '18 08:01 dongzhuoyao

Hmm, seems this problem cannot be reproduced. Would you mind to train again and see whether everything goes well?

Jan 03 '18 16:01 bearpaw

Hey, I've met just the same problem with you. I tried to train a 2 stacks HG network with the original code and params. I guess this is cause by strong over-fitting, but I am not sure why overfitting occurs. Do you have any idea? Here's the log: log.txt

Jan 15 '18 02:01 gdwei

Hi, @gdwei @bearpaw I have found the same problem in my side. I used the model having 8 hourglass modules

Epoch   LR      Train Loss      Val Loss        Train Acc       Val Acc
1.000000        0.000250        0.006231        0.008109        0.194155        0.332968
2.000000        0.000250        0.005188        0.006057        0.387743        0.477342
3.000000        0.000250        0.004838        0.005032        0.502596        0.584106
4.000000        0.000250        0.004606        0.004787        0.562090        0.629260
5.000000        0.000250        0.004426        0.004789        0.600115        0.638421
6.000000        0.000250        0.004286        0.004692        0.627019        0.674266
7.000000        0.000250        0.004173        0.004733        0.649596        0.681682
8.000000        0.000250        0.004089        0.005544        0.662832        0.644043
9.000000        0.000250        0.004001        0.005081        0.680730        0.703755
10.000000       0.000250        0.003925        0.005816        0.692677        0.705782
11.000000       0.000250        0.003865        0.005736        0.702876        0.713184
12.000000       0.000250        0.003804        0.007214        0.713316        0.689739
13.000000       0.000250        0.003744        0.009516        0.722215        0.716273
14.000000       0.000250        0.003682        0.016769        0.731847        0.655829
15.000000       0.000250        0.003640        0.026813        0.735956        0.637782
16.000000       0.000250        0.003587        0.033836        0.743873        0.287533
17.000000       0.000250        0.003552        0.055812        0.747483        0.110421
18.000000       0.000250        0.003506        0.090679        0.754163        0.026939
19.000000       0.000250        0.003469        0.246852        0.760248        0.052983
20.000000       0.000250        0.003439        0.478084        0.763902        0.020653

This is my log.

Jan 22 '18 02:01 xizero00

@djangogo for me, setting a smaller learning rate could help, for example, you may set it to be around 1e-5, other tricks on adjusting learning rate could also be helpful.

Jan 22 '18 05:01 gdwei

hello it looks like i also have same problem

Processing |################################| (944/944) Data: 0.000173s | Batch: 0.141s | Total: 0:02:12 | ETA: 0:00:01 | Loss: 4.7316 | Acc:  0.0011

Epoch: 3 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.001675s | Batch: 0.322s | Total: 2:39:51 | ETA: 0:00:01 | Loss: 0.0036 | Acc:  0.5417
Processing |################################| (944/944) Data: 0.000312s | Batch: 0.147s | Total: 0:02:18 | ETA: 0:00:01 | Loss: 5096.5776 | Acc:  0.0000

Epoch: 4 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.001971s | Batch: 0.333s | Total: 2:39:24 | ETA: 0:00:01 | Loss: 0.0036 | Acc:  0.5267
Processing |################################| (944/944) Data: 0.000199s | Batch: 0.148s | Total: 0:02:19 | ETA: 0:00:01 | Loss: 53171.3798 | Acc:  0.0021

Epoch: 5 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.000199s | Batch: 0.324s | Total: 2:40:41 | ETA: 0:00:01 | Loss: 0.0035 | Acc:  0.5406
Processing |################################| (944/944) Data: 0.000270s | Batch: 0.147s | Total: 0:02:18 | ETA: 0:00:01 | Loss: 166093.0824 | Acc:  0.0000

Epoch: 6 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.001795s | Batch: 0.326s | Total: 2:39:21 | ETA: 0:00:01 | Loss: 0.0035 | Acc:  0.5556
Processing |################################| (944/944) Data: 0.000197s | Batch: 0.144s | Total: 0:02:16 | ETA: 0:00:01 | Loss: 808754.2019 | Acc:  0.0001

Epoch: 7 | LR: 0.00025000
Processing |################################| (22333/22333) Data: 0.002017s | Batch: 0.342s | Total: 2:40:26 | ETA: 0:00:01 | Loss: 0.0035 | Acc:  0.5202
Processing |################################| (944/944) Data: 0.000228s | Batch: 0.147s | Total: 0:02:18 | ETA: 0:00:01 | Loss: 377698.3226 | Acc:  0.0000

Epoch: 8 | LR: 0.00025000
Processing |##########                      | (7420/22333) Data: 0.002047s | Batch: 0.424s | Total: 0:53:10 | ETA: 1:47:54 | Loss: 0.0038 | Acc:  0.3522^CProcess Process-15:

Jan 22 '18 08:01 salihkaragoz

@gdwei could you report your result when you set lr=1e-5?

Jan 23 '18 06:01 dongzhuoyao

In our lab., two researchers tried to run the codes, but only one researcher had the same problem. We have kept training over and over, but the result is similar. We utilized same code(without modifying anyting), and same data(copy the data from one's environment to another). Anybody has idea???

Env. generating problem Ubuntu 16.04, python 2.7.13, PyTorch 0.3.0, openCV 3.3.0., 1080Ti

Env. no problem Ubuntu 16.04, python 2.7.12, PyTorch 0.3.0., openCV 3.3.0., GTX TITAN

(added) At the environment which makes problem, the hourglass successfully trained by learning rate 1e-4. Pls check.

Feb 01 '18 07:02 ben-park

Hey guys, I have the same problem too, if I finetune the 8 stack hg network to train a new dataset (about 30K images), the validation accuracy drops down dramaticly after only 4 epochs.

I use lr 2.5e-4

I am trying 1e-4, let's see what will happen

I use 2.5e-5 now, and I only restore the bottom 4 stacks to start the 8 stack training. It seems that overfitting still exists, but I think actually the problem is not that serious as the printed accuracy shows. I wonder the valid acc drops like a disaster because this code use a threshold when caculating accuracy

def dist_acc(dists, thr=0.5):
    ''' Return percentage below threshold while ignoring values with a -1 '''
    if dists.ne(-1).sum() > 0:
        return dists.le(thr).eq(dists.ne(-1)).sum()*1.0 / dists.ne(-1).sum()

Apr 19 '18 09:04 rockeyben

The solution "xingyizhou" metioned works for me. On the environment where Pytorch 0.4.0 makes problem, i reinstall Pytorch 0.2.0, and then the learning finally works at the end without any problem.

However, i still cannot understand why Pytorch 0.4.0 doesnt work with certain computing environment with 1080Ti, TITAN X.

May 14 '18 00:05 ben-park

I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.

May 30 '18 06:05 gdjmck

I'm experiencing the same issue running the MPII example with PyTorch 0.3.1, a 1080 and with default parameters except stacks=1.

Jun 07 '18 15:06 wpeebles

Hi all @dongzhuoyao @gdwei @djangogo @salihkaragoz @Ben-Park @gdjmck @wisp5 @rockeyben, Thanks for the report! In short, down-grading the pytorch version to 0.1.12 will resolve this bug. However it's not an elegant way to do so. I have investigated this bug for some time (See https://github.com/xingyizhou/pytorch-pose-hg-3d/issues/16) but it is still unresolved. Recently I found this bug also occurs in other architecture besides HourglassNet (but still for dense output) on pytorch version > 0.1.12, while v0.1.12 always works fine. I also found a similar bug report at https://discuss.pytorch.org/t/model-eval-gives-incorrect-loss-for-model-with-batchnorm-layers/7561/2, which also use an MSE/L1 loss.

A natural conjecture is that the bug might come from a bug of pytorch BN implementation after 0.1.12 and might occur when applying a network with dense output and BN layer (but not reproducible). The bug might less likely from the data processing or hourglass implementation since this repo and mine implementation (https://github.com/xingyizhou/pytorch-pose-hg-3d/tree/2D) are independent. Please correct me if you have any counterexample or you have more progress on this bug. You are welcomed to discuss more about this bug with me by dropping me an email at [email protected] . Thanks!

Jun 15 '18 05:06 xingyizhou

Hi all, As pointed by @leoxiaobin, turn off cudnn of BN layer resolves the issue. It can be realized by set torch.backends.cudnn.enabled = False in main.py, which disables cudnn for all layers and slows down the training by about 1.5x time, or re-build pytorch from source by hacking cudnn in BN layers https://github.com/pytorch/pytorch/blob/e8536c08a16b533fe0a9d645dd4255513f9f4fdd/aten/src/ATen/native/Normalization.cpp#L46 .

Jul 14 '18 06:07 xingyizhou

Thanks @xingyizhou I make it concrete. for pytorch's high version (such as 0.4.0 or 0.4.1) go to python's directory in your system For windows: and change the batch_norm in PYTHONDIR/Lib/site-packages/torch/nn/functional.py For linux: change the batch_norm in /usr/lib/python2.7/dist-packages/torch/nn/functional.py or /usr/lib/python3.5/dist-packages/torch/nn/functional.py

def batch_norm(input, running_mean, running_var, weight=None, bias=None,
               training=False, momentum=0.1, eps=1e-5):
    r"""Applies Batch Normalization for each channel across a batch of data.

    See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`,
    :class:`~torch.nn.BatchNorm3d` for details.
    """
    if training:
        size = list(input.size())
        if reduce(mul, size[2:], size[0]) == 1:
            raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
     return torch.batch_norm(
         input, weight, bias, running_mean, running_var,
         training, momentum, eps, torch.backends.cudnn.enabled
     )

change to

def batch_norm(input, running_mean, running_var, weight=None, bias=None,
               training=False, momentum=0.1, eps=1e-5):
    r"""Applies Batch Normalization for each channel across a batch of data.

    See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`,
    :class:`~torch.nn.BatchNorm3d` for details.
    """
    if training:
        size = list(input.size())
        if reduce(mul, size[2:], size[0]) == 1:
            raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
    return torch.batch_norm(
        input, weight, bias, running_mean, running_var,
        training, momentum, eps, False
    )

hope it will help others. and I also provide a bash script to patch pytorch automatically(if you use python 3.x, just change it )

#!/usr/bin/env bash
set -e
# patch for BN for pytorch after v0.1.12

PYPATH=/usr/local/lib/python2.7/dist-packages
PYTORCH_VERSION=`python -c "import torch as t; print(t.__version__)"`

if [ ! -e "${PYPATH}/torch/nn/functional.py.bak" ] && [ -e "${PYPATH}/torch/nn/functional.py" ]; then
    # backup
    sudo cp ${PYPATH}/torch/nn/functional.py ${PYPATH}/torch/nn/functional.py.bak
    # patch pytorch
    if [ "${PYTORCH_VERSION}" == "0.4.0" ]; then
        # for pytorch v0.4.0
        sudo sed -i "1194s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py
    elif [ "${PYTORCH_VERSION}" == "0.4.1" ]; then
        # for pytorch v0.4.1
        sudo sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py
    fi
    echo "patch pytorch ${PYTORCH_VERSION} successfully"
else
    echo "You have patched the pytorch!"
fi

Sep 11 '18 08:09 xizero00

Thanks @xingyizhou I make it concrete. for pytorch's high version (such as 0.4.0 or 0.4.1) go to python's directory in your system For windows: and change the batch_norm in PYTHONDIR/Lib/site-packages/torch/nn/functional.py For linux: change the batch_norm in /usr/lib/python2.7/dist-packages/torch/nn/functional.py or /usr/lib/python3.5/dist-packages/torch/nn/functional.py

def batch_norm(input, running_mean, running_var, weight=None, bias=None,
               training=False, momentum=0.1, eps=1e-5):
    r"""Applies Batch Normalization for each channel across a batch of data.

    See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`,
    :class:`~torch.nn.BatchNorm3d` for details.
    """
    if training:
        size = list(input.size())
        if reduce(mul, size[2:], size[0]) == 1:
            raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
     return torch.batch_norm(
         input, weight, bias, running_mean, running_var,
         training, momentum, eps, torch.backends.cudnn.enabled
     )

change to

def batch_norm(input, running_mean, running_var, weight=None, bias=None,
               training=False, momentum=0.1, eps=1e-5):
    r"""Applies Batch Normalization for each channel across a batch of data.

    See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`,
    :class:`~torch.nn.BatchNorm3d` for details.
    """
    if training:
        size = list(input.size())
        if reduce(mul, size[2:], size[0]) == 1:
            raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
    return torch.batch_norm(
        input, weight, bias, running_mean, running_var,
        training, momentum, eps, False
    )

hope it will help others. and I also provide a bash script to patch pytorch automatically(if you use python 3.x, just change it )

#!/usr/bin/env bash
set -e
# patch for BN for pytorch after v0.1.12

PYPATH=/usr/local/lib/python2.7/dist-packages
PYTORCH_VERSION=`python -c "import torch as t; print(t.__version__)"`

if [ ! -e "${PYPATH}/torch/nn/functional.py.bak" ] && [ -e "${PYPATH}/torch/nn/functional.py" ]; then
    # backup
    sudo cp ${PYPATH}/torch/nn/functional.py ${PYPATH}/torch/nn/functional.py.bak
    # patch pytorch
    if [ "${PYTORCH_VERSION}" == "0.4.0" ]; then
        # for pytorch v0.4.0
        sudo sed -i "1194s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py
    elif [ "${PYTORCH_VERSION}" == "0.4.1" ]; then
        # for pytorch v0.4.1
        sudo sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py
    fi
    echo "patch pytorch ${PYTORCH_VERSION} successfully"
else
    echo "You have patched the pytorch!"
fi

Thanks @xingyizhou I make it concrete. for pytorch's high version (such as 0.4.0 or 0.4.1) go to python's directory in your system For windows: and change the batch_norm in PYTHONDIR/Lib/site-packages/torch/nn/functional.py For linux: change the batch_norm in /usr/lib/python2.7/dist-packages/torch/nn/functional.py or /usr/lib/python3.5/dist-packages/torch/nn/functional.py

def batch_norm(input, running_mean, running_var, weight=None, bias=None,
               training=False, momentum=0.1, eps=1e-5):
    r"""Applies Batch Normalization for each channel across a batch of data.

    See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`,
    :class:`~torch.nn.BatchNorm3d` for details.
    """
    if training:
        size = list(input.size())
        if reduce(mul, size[2:], size[0]) == 1:
            raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
     return torch.batch_norm(
         input, weight, bias, running_mean, running_var,
         training, momentum, eps, torch.backends.cudnn.enabled
     )

change to

def batch_norm(input, running_mean, running_var, weight=None, bias=None,
               training=False, momentum=0.1, eps=1e-5):
    r"""Applies Batch Normalization for each channel across a batch of data.

    See :class:`~torch.nn.BatchNorm1d`, :class:`~torch.nn.BatchNorm2d`,
    :class:`~torch.nn.BatchNorm3d` for details.
    """
    if training:
        size = list(input.size())
        if reduce(mul, size[2:], size[0]) == 1:
            raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
    return torch.batch_norm(
        input, weight, bias, running_mean, running_var,
        training, momentum, eps, False
    )

hope it will help others. and I also provide a bash script to patch pytorch automatically(if you use python 3.x, just change it )

#!/usr/bin/env bash
set -e
# patch for BN for pytorch after v0.1.12

PYPATH=/usr/local/lib/python2.7/dist-packages
PYTORCH_VERSION=`python -c "import torch as t; print(t.__version__)"`

if [ ! -e "${PYPATH}/torch/nn/functional.py.bak" ] && [ -e "${PYPATH}/torch/nn/functional.py" ]; then
    # backup
    sudo cp ${PYPATH}/torch/nn/functional.py ${PYPATH}/torch/nn/functional.py.bak
    # patch pytorch
    if [ "${PYTORCH_VERSION}" == "0.4.0" ]; then
        # for pytorch v0.4.0
        sudo sed -i "1194s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py
    elif [ "${PYTORCH_VERSION}" == "0.4.1" ]; then
        # for pytorch v0.4.1
        sudo sed -i "1254s/torch\.backends\.cudnn\.enabled/False/g" ${PYPATH}/torch/nn/functional.py
    fi
    echo "patch pytorch ${PYTORCH_VERSION} successfully"
else
    echo "You have patched the pytorch!"
fi

Huge Thanks for your fix. I was running into this issue as well, your fix seems to help

Nov 13 '18 04:11 moizsaifee

hello everyone,I have a problem when i was training the model from scratch. The model have 2 stacks HG network with the original code and params.And the lr is 2.5e-4. Can somebody can help me to solve the problem?Thanks!

Epoch LR Train Loss Val Loss Train Acc Val Acc 1.000000 0.000250 0.000820 0.001804 0.012718 0.016598 2.000000 0.000250 0.000605 0.001752 0.023349 0.018282 3.000000 0.000250 0.000601 0.003543 0.026636 0.016567 4.000000 0.000250 0.000601 0.002009 0.030169 0.023730 5.000000 0.000250 0.000605 0.007554 0.021984 0.024761 6.000000 0.000250 0.000593 0.001323 0.021573 0.023211 6.000000 0.000250 0.000581 0.000750 0.030056 0.037204 7.000000 0.000250 0.000579 0.001724 0.062914 0.042289 8.000000 0.000250 0.000574 0.006119 0.078612 0.029493 9.000000 0.000250 0.000568 0.002073 0.092811 0.032917 10.000000 0.000250 0.000565 0.002764 0.103415 0.082355 11.000000 0.000250 0.000559 0.004456 0.118935 0.069083 12.000000 0.000250 0.000554 0.001235 0.136579 0.111532 13.000000 0.000250 0.000551 0.001291 0.157845 0.139160 14.000000 0.000250 0.000546 0.000833 0.172080 0.187071 15.000000 0.000250 0.000540 0.000677 0.188202 0.137926 16.000000 0.000250 0.000536 0.000822 0.204400 0.126236 17.000000 0.000250 0.000529 0.007549 0.223867 0.023203 18.000000 0.000250 0.000514 0.001865 0.248268 0.100743 19.000000 0.000250 0.000500 0.001187 0.281679 0.162482 20.000000 0.000250 0.000491 0.002932 0.311082 0.045916 21.000000 0.000250 0.000484 0.001115 0.335782 0.107354 22.000000 0.000250 0.000476 0.009399 0.356727 0.008605 23.000000 0.000250 0.000470 0.000646 0.369132 0.005161 24.000000 0.000250 0.000463 0.003118 0.386070 0.022706 25.000000 0.000250 0.000457 0.000577 0.399309 0.018974 26.000000 0.000250 0.000451 0.000582 0.417991 0.019046 27.000000 0.000250 0.000446 0.001388 0.432328 0.010517 28.000000 0.000250 0.000441 0.000769 0.444523 0.019780 29.000000 0.000250 0.000436 0.000739 0.456251 0.014726 30.000000 0.000250 0.000432 0.001276 0.469130 0.056828 31.000000 0.000250 0.000428 0.001579 0.478094 0.023356 32.000000 0.000250 0.000423 0.000569 0.491334 0.006764 33.000000 0.000250 0.000420 0.000907 0.499913 0.045504 34.000000 0.000250 0.000416 0.000600 0.508063 0.101544 35.000000 0.000250 0.000412 0.000581 0.516998 0.077281 36.000000 0.000250 0.000408 0.000618 0.525647 0.047941 37.000000 0.000250 0.000404 0.000635 0.534216 0.036322 38.000000 0.000250 0.000402 0.000749 0.539956 0.002505 39.000000 0.000250 0.000401 0.000553 0.542420 0.070047 40.000000 0.000250 0.000396 0.000551 0.552550 0.038140 41.000000 0.000250 0.000393 0.000577 0.558081 0.025742 42.000000 0.000250 0.000390 0.000560 0.564162 0.031878 43.000000 0.000250 0.000387 0.000560 0.569910 0.008823 44.000000 0.000250 0.000384 0.000576 0.575056 0.007596 45.000000 0.000250 0.000381 0.000549 0.581056 0.003241 46.000000 0.000250 0.000379 0.000550 0.584639 0.000000 47.000000 0.000250 0.000377 0.000561 0.589435 0.026145 48.000000 0.000250 0.000374 0.000548 0.593368 0.020744

Mar 12 '19 15:03 stickOverCarrot

I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.

Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.

Jun 26 '19 16:06 DNALuo

I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.

Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.

Oh, it was something about the accuarcy method that it only return 1 or 0 for the batch sent in, so with small batch size you'll have higher probability to get them all right and get 1 otherwise you get 0.

Jun 27 '19 03:06 gdjmck

I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.

Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.

Oh, it was something about the accuarcy method that it only return 1 or 0 for the batch sent in, so with small batch size you'll have higher probability to get them all right and get 1 otherwise you get 0.

So it is caused by your wrong evaluation code, not because of other reasons ?

Jun 27 '19 18:06 DNALuo

I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.

Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.

Oh, it was something about the accuarcy method that it only return 1 or 0 for the batch sent in, so with small batch size you'll have higher probability to get them all right and get 1 otherwise you get 0.

So it is caused by your wrong evaluation code, not because of other reasons ?

Yeah, I used the evaluation code of the source code back then, maybe it conflicted with the PyTorch version and it worked all right when I fixed it. Did your performance on testset differs with different batch size?

Jun 28 '19 03:06 gdjmck

I use the same pretrained model and test it with different testing batch size, and surprisingly it get very different precision rate. Smaller the testing batch size is, the higher precision rate is, with a batch size of 2, I got 80+% and only 50+% when batch size is 6 or 8. It is very abnormal to me, as I thought it shouldn't matter with the batch size.

Hi, do you resolve the issue? I am still suffering it from finetuning it and only gain around 60+% val accuracy.

Oh, it was something about the accuarcy method that it only return 1 or 0 for the batch sent in, so with small batch size you'll have higher probability to get them all right and get 1 otherwise you get 0.

So it is caused by your wrong evaluation code, not because of other reasons ?

Yeah, I used the evaluation code of the source code back then, maybe it conflicted with the PyTorch version and it worked all right when I fixed it. Did your performance on testset differs with different batch size?

Yes, but I am using other dataset to finetune hourglass model on that. My model works well on MPII but quite worse when finetuning on that dataset, at least a big difference with the results provided by other researchers. But their code is written in torch, so now I am searching for anything that can help.

Jun 28 '19 04:06 DNALuo

pytorch-pose pytorch-pose copied to clipboard

validation acc drops down drastically after epoch 10

pytorch-pose
pytorch-pose copied to clipboard