Swin-Transformer icon indicating copy to clipboard operation
Swin-Transformer copied to clipboard

Training logs for Swin-B/S/L

Open lcmeng opened this issue 3 years ago • 12 comments

Hi authors,

Following the official training command below, I observed unstable training loss and accuracies around epoch # 20.

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py \ --cfg configs/swin_base_patch4_window7_224.yaml --data-path <imagenet-path> --batch-size 64 \ --accumulation-steps 2 [--use-checkpoint]

Can you please share the training logs for Swin-B? And if more logs are available, please consider sharing them as well.

TIA!

lcmeng avatar Dec 06 '21 19:12 lcmeng

Has anyone been able to reproduce the results of larger architectures? I contacted the authors about two weeks ago, but it seems like a dead end.

lcmeng avatar Dec 19 '21 00:12 lcmeng

Hi @lcmeng, training logs are available here: https://github.com/microsoft/Swin-Transformer/blob/b05e6214a37d33846903585c9e83b694ef411587/README.md?plain=1#L79-L81

zeliu98 avatar Dec 20 '21 16:12 zeliu98

Thank you. But why are all the configurations intentionally removed from the log files? For example, in the previously shared log file here, one can find the configurations of the run.

As the original post of this thread mentioned, the recommended setup for Swin-B @ ImageNet 1K does not converge. Can you please share the full log to help reproduce your results? If it requires a different setup, can you please also share exactly how? Thanks.

lcmeng avatar Dec 21 '21 23:12 lcmeng

The logs of Swin-S/B were generated by an earlier version of the code, which didn't write the configs. However, the configs are the same with the configs provided here. 16 V100 GPUs are used to train Swin-S and Swin-B.

For your case (8 GPUs with accumulation-steps=2 ), I have tried using the same command as you. It have finished 95 epochs, and it's very stable. You can refer to this log: log_rank0.txt

If you still face the problem of training instability, I suggest:

  • Check the version of your python packages. You can try this docker nvcr-21.05 that I have checked.
  • If you are using your custom dataset, maybe you can reduce the learning rate or increase the warmup epochs.
  • Change to O0, since most of the instability issues are caused by float16.

If none of the above suggestions work, please share me with your log so I can take a closer look.

zeliu98 avatar Dec 24 '21 08:12 zeliu98

Hi @zeliu98 , Thanks for your great work! I met a similar problem with @lcmeng .

On four V-100 GPUs, I used the default commander below. It turns out the network can converge, but the accuracy is about 1% lower than reported. The final accuracy is 82.1% (The reported is 83.5%).

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --cfg configs/swin_base_patch4_window7_224.yaml --data-path /dataset/imagenet --batch-size 128

The logfile of this training is here.

Could you please take a look and advice on how to reproduce the same accuracy? Thank you very much!

he-y avatar Dec 26 '21 10:12 he-y

Hi @he-y, the total batch size should be 1024, but yours is 512. Try this command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --cfg configs/swin_base_patch4_window7_224.yaml --data-path /dataset/imagenet --batch-size 128  --accumulation-steps 2

zeliu98 avatar Dec 27 '21 06:12 zeliu98

Hi @zeliu98 , thank you for the detailed reply. I've listed the installed dependencies for Swin experiments. They seem to fully agree with the requirements. Can you spot any inconsistencies? TIA.

(swin) ubuntu@ip-10-0-0-94:~$ conda list
# packages in environment at /home/ubuntu/anaconda3/envs/swin:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
apex                      0.1                      pypi_0    pypi
blas                      1.0                         mkl  
ca-certificates           2021.5.30            ha878542_0    conda-forge
certifi                   2021.5.30        py37h89c1867_0    conda-forge
cudatoolkit               10.1.243             h036e899_8    conda-forge
freetype                  2.10.4               h0708190_1    conda-forge
intel-openmp              2021.3.0          h06a4308_3350  
jpeg                      9b                   h024ee3a_2  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 11.1.0               hc902ee8_8    conda-forge
libgomp                   11.1.0               hc902ee8_8    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libstdcxx-ng              11.1.0               h56837e0_8    conda-forge
libtiff                   4.2.0                h85742a9_0  
libuv                     1.42.0               h7f98852_0    conda-forge
libwebp-base              1.2.0                h7f98852_2    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
mkl                       2021.3.0           h06a4308_520  
mkl-service               2.4.0            py37h5e8e339_0    conda-forge
mkl_fft                   1.3.0            py37h42c9631_2  
mkl_random                1.2.2            py37h219a48f_0    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
ninja                     1.10.2               h4bd325d_0    conda-forge
numpy                     1.20.3           py37hf144106_0  
numpy-base                1.20.3           py37h74d4b33_0  
olefile                   0.46               pyh9f0ad1d_1    conda-forge
opencv-python             4.4.0.46                 pypi_0    pypi
openjpeg                  2.4.0                hb52868f_1    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
pillow                    8.3.1            py37h2c7a002_0  
pip                       21.2.3             pyhd8ed1ab_0    conda-forge
python                    3.7.10          hffdb5ce_100_cpython    conda-forge
python_abi                3.7                     2_cp37m    conda-forge
pytorch                   1.7.1           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
pyyaml                    5.4.1                    pypi_0    pypi
readline                  8.1                  h46c0cb4_0    conda-forge
setuptools                49.6.0           py37h89c1867_3    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.36.0               h9cd32fc_0    conda-forge
termcolor                 1.1.0                    pypi_0    pypi
timm                      0.3.2                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
torchvision               0.8.2                py37_cu101    pytorch
typing_extensions         3.10.0.0           pyha770c72_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
yacs                      0.1.8                    pypi_0    pypi
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge
(swin) ubuntu@ip-10-0-0-94:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
(swin) ubuntu@ip-10-0-0-94:~$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
57:#define CUDNN_MAJOR 7
58-#define CUDNN_MINOR 6
59-#define CUDNN_PATCHLEVEL 5
--
61:#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
62-
63-#include "driver_types.h"

lcmeng avatar Dec 29 '21 20:12 lcmeng

And about the recommended Nvidia docker image nvcr-21.05, does it not conflict with the recommended dependencies? For example, it contains CUDA 11.3 (vs. the recommended 10.1) and PyTorch 1.9.0 (vs. the recommended 1.7.1). Did I miss something here?

lcmeng avatar Dec 29 '21 20:12 lcmeng

@zeliu98 In the newly released logs for larger Swin archs, the typical amp loss scaling, i.e. Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to xyz is not seen anywhere. Does it mean these models were trained solely in FP32 w/o any amp? From your experience, does turning off amp (set to O0) help sustain a higher learning rate? Thank you and happy new year.

lcmeng avatar Dec 31 '21 17:12 lcmeng

Hi @lcmeng, the model is trained using the default mixed-precision (O1). We doesn't deal with the logging of amp so the loss scaling info is not wrote to the log file.

Your environment seems correct and I am not very clear about the reason about your problem. According to other users' feedback, installing apex from source may be error-prone, so I just suspect that there might be something wrong with your apex. The nvcr-21.05 docker has installed apex by itself, so you can try it first. The version of CUDA and pytorch is ok and I have checked.

Besides, you can share me with your log so I can look into it further.

zeliu98 avatar Jan 01 '22 17:01 zeliu98

@zeliu98, thank you for the explanation. I've added some TensorBoard code to Swin to generate visualization of the training. It seems the drop of accuracy near the peak LR is correlated with the explosion of gradient (norm).

Please see the attached screenshots. The LR appeared to be doubled, due to accumulation step = 2. It is in fact the same as the recommended setup.

(1) The trace of gradient norm over global steps. It increased very aggressively after the initial "flat" phase. Screen Shot 2022-01-26 at 12 46 44 PM

(2) Max top-1 accuracy happened at Epoch 15. Screen Shot 2022-01-26 at 12 51 40 PM

(3) Using the recommended LR schedule Screen Shot 2022-01-26 at 12 51 02 PM

lcmeng avatar Jan 28 '22 07:01 lcmeng

@lcmeng Hey, I'm wondering what the point it is from the grad_norm. I have seen some people use this metric with their issue about convergence of Swin. Would you please give a hint, thanks.

BitCalSaul avatar Feb 10 '24 05:02 BitCalSaul