Swin-Transformer
Swin-Transformer copied to clipboard
Training logs for Swin-B/S/L
Hi authors,
Following the official training command below, I observed unstable training loss and accuracies around epoch # 20.
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py \ --cfg configs/swin_base_patch4_window7_224.yaml --data-path <imagenet-path> --batch-size 64 \ --accumulation-steps 2 [--use-checkpoint]
Can you please share the training logs for Swin-B? And if more logs are available, please consider sharing them as well.
TIA!
Has anyone been able to reproduce the results of larger architectures? I contacted the authors about two weeks ago, but it seems like a dead end.
Hi @lcmeng, training logs are available here: https://github.com/microsoft/Swin-Transformer/blob/b05e6214a37d33846903585c9e83b694ef411587/README.md?plain=1#L79-L81
Thank you. But why are all the configurations intentionally removed from the log files? For example, in the previously shared log file here, one can find the configurations of the run.
As the original post of this thread mentioned, the recommended setup for Swin-B @ ImageNet 1K does not converge. Can you please share the full log to help reproduce your results? If it requires a different setup, can you please also share exactly how? Thanks.
The logs of Swin-S/B were generated by an earlier version of the code, which didn't write the configs. However, the configs are the same with the configs provided here. 16 V100 GPUs are used to train Swin-S and Swin-B.
For your case (8 GPUs with accumulation-steps=2 ), I have tried using the same command as you. It have finished 95 epochs, and it's very stable. You can refer to this log: log_rank0.txt
If you still face the problem of training instability, I suggest:
- Check the version of your python packages. You can try this docker nvcr-21.05 that I have checked.
- If you are using your custom dataset, maybe you can reduce the learning rate or increase the warmup epochs.
- Change to O0, since most of the instability issues are caused by float16.
If none of the above suggestions work, please share me with your log so I can take a closer look.
Hi @zeliu98 , Thanks for your great work! I met a similar problem with @lcmeng .
On four V-100 GPUs, I used the default commander below. It turns out the network can converge, but the accuracy is about 1% lower than reported. The final accuracy is 82.1% (The reported is 83.5%).
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --cfg configs/swin_base_patch4_window7_224.yaml --data-path /dataset/imagenet --batch-size 128
The logfile of this training is here.
Could you please take a look and advice on how to reproduce the same accuracy? Thank you very much!
Hi @he-y, the total batch size should be 1024, but yours is 512. Try this command:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --cfg configs/swin_base_patch4_window7_224.yaml --data-path /dataset/imagenet --batch-size 128 --accumulation-steps 2
Hi @zeliu98 , thank you for the detailed reply. I've listed the installed dependencies for Swin experiments. They seem to fully agree with the requirements. Can you spot any inconsistencies? TIA.
(swin) ubuntu@ip-10-0-0-94:~$ conda list
# packages in environment at /home/ubuntu/anaconda3/envs/swin:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
apex 0.1 pypi_0 pypi
blas 1.0 mkl
ca-certificates 2021.5.30 ha878542_0 conda-forge
certifi 2021.5.30 py37h89c1867_0 conda-forge
cudatoolkit 10.1.243 h036e899_8 conda-forge
freetype 2.10.4 h0708190_1 conda-forge
intel-openmp 2021.3.0 h06a4308_3350
jpeg 9b h024ee3a_2
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 11.1.0 hc902ee8_8 conda-forge
libgomp 11.1.0 hc902ee8_8 conda-forge
libpng 1.6.37 h21135ba_2 conda-forge
libstdcxx-ng 11.1.0 h56837e0_8 conda-forge
libtiff 4.2.0 h85742a9_0
libuv 1.42.0 h7f98852_0 conda-forge
libwebp-base 1.2.0 h7f98852_2 conda-forge
lz4-c 1.9.3 h9c3ff4c_1 conda-forge
mkl 2021.3.0 h06a4308_520
mkl-service 2.4.0 py37h5e8e339_0 conda-forge
mkl_fft 1.3.0 py37h42c9631_2
mkl_random 1.2.2 py37h219a48f_0 conda-forge
ncurses 6.2 h58526e2_4 conda-forge
ninja 1.10.2 h4bd325d_0 conda-forge
numpy 1.20.3 py37hf144106_0
numpy-base 1.20.3 py37h74d4b33_0
olefile 0.46 pyh9f0ad1d_1 conda-forge
opencv-python 4.4.0.46 pypi_0 pypi
openjpeg 2.4.0 hb52868f_1 conda-forge
openssl 1.1.1k h7f98852_0 conda-forge
pillow 8.3.1 py37h2c7a002_0
pip 21.2.3 pyhd8ed1ab_0 conda-forge
python 3.7.10 hffdb5ce_100_cpython conda-forge
python_abi 3.7 2_cp37m conda-forge
pytorch 1.7.1 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
pyyaml 5.4.1 pypi_0 pypi
readline 8.1 h46c0cb4_0 conda-forge
setuptools 49.6.0 py37h89c1867_3 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sqlite 3.36.0 h9cd32fc_0 conda-forge
termcolor 1.1.0 pypi_0 pypi
timm 0.3.2 pypi_0 pypi
tk 8.6.10 h21135ba_1 conda-forge
torchvision 0.8.2 py37_cu101 pytorch
typing_extensions 3.10.0.0 pyha770c72_0 conda-forge
wheel 0.36.2 pyhd3deb0d_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
yacs 0.1.8 pypi_0 pypi
zlib 1.2.11 h516909a_1010 conda-forge
zstd 1.4.9 ha95c52a_0 conda-forge
(swin) ubuntu@ip-10-0-0-94:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
(swin) ubuntu@ip-10-0-0-94:~$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
57:#define CUDNN_MAJOR 7
58-#define CUDNN_MINOR 6
59-#define CUDNN_PATCHLEVEL 5
--
61:#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)
62-
63-#include "driver_types.h"
And about the recommended Nvidia docker image nvcr-21.05, does it not conflict with the recommended dependencies? For example, it contains CUDA 11.3 (vs. the recommended 10.1) and PyTorch 1.9.0 (vs. the recommended 1.7.1). Did I miss something here?
@zeliu98 In the newly released logs for larger Swin archs, the typical amp loss scaling, i.e. Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to xyz
is not seen anywhere. Does it mean these models were trained solely in FP32 w/o any amp
? From your experience, does turning off amp
(set to O0
) help sustain a higher learning rate? Thank you and happy new year.
Hi @lcmeng, the model is trained using the default mixed-precision (O1). We doesn't deal with the logging of amp so the loss scaling info is not wrote to the log file.
Your environment seems correct and I am not very clear about the reason about your problem. According to other users' feedback, installing apex from source may be error-prone, so I just suspect that there might be something wrong with your apex. The nvcr-21.05 docker has installed apex by itself, so you can try it first. The version of CUDA and pytorch is ok and I have checked.
Besides, you can share me with your log so I can look into it further.
@zeliu98, thank you for the explanation. I've added some TensorBoard code to Swin to generate visualization of the training. It seems the drop of accuracy near the peak LR is correlated with the explosion of gradient (norm).
Please see the attached screenshots. The LR appeared to be doubled, due to accumulation step = 2. It is in fact the same as the recommended setup.
(1) The trace of gradient norm over global steps. It increased very aggressively after the initial "flat" phase.
(2) Max top-1 accuracy happened at Epoch 15.
(3) Using the recommended LR schedule
@lcmeng Hey, I'm wondering what the point it is from the grad_norm. I have seen some people use this metric with their issue about convergence of Swin. Would you please give a hint, thanks.