Swin-Transformer zipped ImageNet processing scripts

Hi, can you provide processing scripts for zipped ImageNet ?

Apr 18 '22 07:04 gongjingcs

+1, I found that it need more 100G+ memory during the data preparation when i was training the Swin-Transformer。it's unbelieveable. there is some information for this question.

Conditions: TAG: default TEST: CROP: true SEQUENTIAL: false THROUGHPUT_MODE: false TRAIN: ACCUMULATION_STEPS: 0 AUTO_RESUME: true BASE_LR: 0.0004375 CLIP_GRAD: 5.0 EPOCHS: 300 LR_SCHEDULER: DECAY_EPOCHS: 30 DECAY_RATE: 0.1 NAME: cosine MIN_LR: 4.3750000000000005e-06 OPTIMIZER: BETAS: - 0.9 - 0.999 EPS: 1.0e-08 MOMENTUM: 0.9 NAME: adamw START_EPOCH: 0 USE_CHECKPOINT: false WARMUP_EPOCHS: 20 WARMUP_LR: 4.375e-07 WEIGHT_DECAY: 0.05

global_rank 6 cached 0/1281167 takes 0.00s per block global_rank 3 cached 0/1281167 takes 0.00s per block global_rank 5 cached 0/1281167 takes 0.00s per block global_rank 2 cached 0/1281167 takes 0.00s per block global_rank 0 cached 0/1281167 takes 0.00s per block global_rank 7 cached 0/1281167 takes 0.00s per block global_rank 1 cached 0/1281167 takes 0.00s per block global_rank 4 cached 0/1281167 takes 0.00s per block global_rank 6 cached 128116/1281167 takes 52.54s per block global_rank 5 cached 128116/1281167 takes 52.40s per block global_rank 4 cached 128116/1281167 takes 51.70s per block global_rank 7 cached 128116/1281167 takes 52.25s per block global_rank 0 cached 128116/1281167 takes 52.32s per block global_rank 2 cached 128116/1281167 takes 52.33s per block global_rank 3 cached 128116/1281167 takes 52.48s per block global_rank 1 cached 128116/1281167 takes 52.20s per block global_rank 0 cached 256232/1281167 takes 25.78s per block global_rank 7 cached 256232/1281167 takes 25.78s per block global_rank 6 cached 256232/1281167 takes 25.78s per block global_rank 3 cached 256232/1281167 takes 25.78s per block global_rank 4 cached 256232/1281167 takes 25.78s per block global_rank 5 cached 256232/1281167 takes 25.78s per block global_rank 2 cached 256232/1281167 takes 25.78s per block global_rank 1 cached 256232/1281167 takes 25.78s per block``

it will up to the number of train and val images. should it cache the image data or the list. but my memory is out. So I think it's the image data.

The matter as follows: global_rank 3 cached 640580/1281167 takes 28.99s per block global_rank 0 cached 768696/1281167 takes 27.38s per block global_rank 1 cached 768696/1281167 takes 27.37s per block global_rank 2 cached 768696/1281167 takes 27.38s per block global_rank 4 cached 768696/1281167 takes 27.38s per block global_rank 7 cached 768696/1281167 takes 27.38s per block global_rank 5 cached 768696/1281167 takes 27.38s per block global_rank 6 cached 768696/1281167 takes 27.38s per block global_rank 2 cached 896812/1281167 takes 27.98s per block global_rank 6 cached 896812/1281167 takes 27.98s per block global_rank 7 cached 896812/1281167 takes 27.98s per block global_rank 4 cached 896812/1281167 takes 27.98s per block global_rank 0 cached 896812/1281167 takes 27.99s per block global_rank 1 cached 896812/1281167 takes 27.99s per block Traceback (most recent call last): File "/root/anaconda3/envs/swin/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/anaconda3/envs/swin/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/anaconda3/envs/swin/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in main() File "/root/anaconda3/envs/swin/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/root/anaconda3/envs/swin/bin/python', '-u', 'main.py', '--local_rank=7', '--cfg', 'configs/swin_small_patch4_window7_224.yaml', '--output=/root/tuantuan1/model/swin/', '--zip', '--cache-mode', 'part', '--data-path', '/root/tuantuan1/data/ImageNet-Zip', '--batch-size', '56']' died with <Signals.SIGKILL: 9>. (swin) root@a41cbab8ac5e:~/workspace/Swin-Transformer# global_rank 2 cached 1024928/1281167 takes 32.62s per block global_rank 7 cached 1024928/1281167 takes 32.62s per block global_rank 1 cached 1024928/1281167 takes 32.61s per block global_rank 4 cached 1024928/1281167 takes 32.63s per block global_rank 6 cached 1024928/1281167 takes 32.63s per block global_rank 4 cached 1153044/1281167 takes 31.60s per block

run command follows:

dmesg -T | grep -E [Fri May 13 22:25:33 2022] [Fri May 13 22:25:33 2022] [ pid ] [Fri May 13 22:25:33 2022] [56913] [Fri May 13 22:25:33 2022] [56985] [Fri May 13 22:25:33 2022] [ 538] [Fri May 13 22:25:33 2022] [ 540] [Fri May 13 22:25:33 2022] [50918] [Fri May 13 22:25:33 2022] [29230] [Fri May 13 22:25:33 2022] [29337] [Fri May 13 22:25:33 2022] [52248] [Fri May 13 22:25:33 2022] [52268] [Fri May 13 22:25:33 2022] [21352] [Fri May 13 22:25:33 2022] [25338] [Fri May 13 22:25:33 2022] [17019] [Fri May 13 22:25:33 2022] [17274] [Fri May 13 22:25:33 2022] [ 2912] [Fri May 13 22:25:33 2022] [ 2926] [Fri May 13 22:25:33 2022] [30026] [Fri May 13 22:25:33 2022] [30027] [Fri May 13 22:25:33 2022] [30029] [Fri May 13 22:25:33 2022] [30031] [Fri May 13 22:25:33 2022] [30032] [Fri May 13 22:25:33 2022] [44777] [Fri May 13 22:25:33 2022] [Fri May 13 22:25:33 2022] -i -B100 'killed process' Memory cgroup stats for /docker/a41cbab8ac5e0d3680e664e26bcc8890070bf4607a0744f00051fcc046a9ca1b: cache:345232KB rss:104203124KB rss_huge:0KB shmem:342340KB mapped_file:344784KB dirty:264KB writeback:0KB inactive_anon:62144KB active_anon:104487672KB inactive_file:1508KB active_file:40KB unevictable:0KB uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name 0 56913 271 1 32768 0 1000 docker-init 0 56985 4540 781 81920 0 1000 bash 0 538 4540 514 73728 0 1000 bash 0 540 1094 162 57344 0 1000 sleep 0 50918 16378 984 163840 0 1000 sshd 0 29230 23231 1686 212992 0 1000 sshd 0 29337 3220 485 69632 0 1000 sftp-server 0 52248 23235 1731 217088 0 1000 sshd 0 52268 4621 887 77824 0 1000 bash 0 21352 12791 2062 131072 0 1000 vim 0 25338 12791 2081 135168 0 1000 vim 0 17019 2404 654 65536 0 1000 bash 0 17274 2404 631 65536 0 1000 bash 0 2912 23199 1720 225280 0 1000 sshd 0 2926 4611 875 77824 0 1000 bash 0 30026 14506230 5336525 44224512 0 1000 python 0 30027 14499677 5329772 44138496 0 1000 python 0 30029 14553669 5350753 44294144 0 1000 python 0 30031 14511896 5342122 44257280 0 1000 python 0 30032 14511890 5342005 44220416 0 1000 python 0 44777 1094 167 53248 0 1000 sleep Memory cgroup out of memory: Kill process 30029 (python) score 1198 or sacrifice child Killed process 30029 (python) total-vm:58214676kB, anon-rss:20876164kB, file-rss:439200kB, shmem-rss:87648kB

My memory information as follow:(swap is equal the memory, but it was limited.)

root@a41cbab8ac5e:~# free -g total used free shared buff/cache available Mem: 251 91 123 0 36 157 Swap: 0 0 0 root@a41cbab8ac5e:~#

May 13 '22 14:05 OxInsky

please help me, very thanks!

May 13 '22 14:05 OxInsky

I succeed! I allocated my memory up to 200G but cannot support 8GPU with bs_56 for training as well. it out of memory. So I set 4GPUs with bs48. it works！ maybe i should allocate more memory . Try it later.

[2022-05-13 23:36:32 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][860/6672] eta 0:47:01 lr 0.000421 time 0.5162 (0.4855) loss 5.1306 (4.9953) grad_norm 2.3225 (2.8221) mem 7785MB [2022-05-13 23:36:37 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][870/6672] eta 0:46:57 lr 0.000421 time 0.4454 (0.4856) loss 5.0256 (4.9957) grad_norm 3.3165 (2.8220) mem 7785MB [2022-05-13 23:36:42 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][880/6672] eta 0:46:51 lr 0.000421 time 0.4642 (0.4855) loss 5.5207 (4.9914) grad_norm 2.4980 (2.8192) mem 7785MB [2022-05-13 23:36:46 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][890/6672] eta 0:46:46 lr 0.000421 time 0.4609 (0.4853) loss 3.8626 (4.9899) grad_norm 2.3808 (2.8147) mem 7785MB [2022-05-13 23:36:51 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][900/6672] eta 0:46:40 lr 0.000421 time 0.4697 (0.4851) loss 5.3830 (4.9896) grad_norm 2.3411 (2.8122) mem 7785MB [2022-05-13 23:36:56 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][910/6672] eta 0:46:35 lr 0.000421 time 0.4822 (0.4852) loss 3.9433 (4.9898) grad_norm 2.6711 (2.8111) mem 7785MB [2022-05-13 23:37:01 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][920/6672] eta 0:46:31 lr 0.000421 time 0.4837 (0.4852) loss 5.8288 (4.9927) grad_norm 2.7145 (2.8098) mem 7785MB [2022-05-13 23:37:06 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][930/6672] eta 0:46:26 lr 0.000421 time 0.4757 (0.4852) loss 5.0991 (4.9949) grad_norm 2.9705 (2.8072) mem 7785MB [2022-05-13 23:37:10 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][940/6672] eta 0:46:20 lr 0.000421 time 0.4732 (0.4850) loss 5.0595 (4.9934) grad_norm 3.3814 (2.8068) mem 7785MB [2022-05-13 23:37:15 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][950/6672] eta 0:46:14 lr 0.000421 time 0.4738 (0.4849) loss 3.7327 (4.9873) grad_norm 2.5955 (2.8050) mem 7785MB [2022-05-13 23:37:20 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][960/6672] eta 0:46:08 lr 0.000421 time 0.4729 (0.4847) loss 5.1887 (4.9857) grad_norm 2.2804 (2.8031) mem 7785MB [2022-05-13 23:37:25 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][970/6672] eta 0:46:03 lr 0.000421 time 0.4771 (0.4846) loss 4.8851 (4.9864) grad_norm 3.8150 (2.8028) mem 7785MB [2022-05-13 23:37:29 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][980/6672] eta 0:45:57 lr 0.000421 time 0.4817 (0.4845) loss 5.2875 (4.9864) grad_norm 2.5579 (2.8007) mem 7785MB [2022-05-13 23:37:34 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][990/6672] eta 0:45:52 lr 0.000421 time 0.4870 (0.4844) loss 5.1356 (4.9867) grad_norm 2.7010 (2.8004) mem 7785MB [2022-05-13 23:37:39 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1000/6672] eta 0:45:46 lr 0.000421 time 0.4722 (0.4843) loss 4.7810 (4.9856) grad_norm 3.4885 (2.8016) mem 7785MB [2022-05-13 23:37:43 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1010/6672] eta 0:45:41 lr 0.000421 time 0.4764 (0.4841) loss 5.5549 (4.9877) grad_norm 2.4293 (2.7994) mem 7785MB [2022-05-13 23:37:48 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1020/6672] eta 0:45:35 lr 0.000421 time 0.4646 (0.4840) loss 5.3800 (4.9887) grad_norm 2.8842 (2.7980) mem 7785MB [2022-05-13 23:37:53 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1030/6672] eta 0:45:30 lr 0.000421 time 0.4699 (0.4839) loss 5.5460 (4.9855) grad_norm 2.5153 (2.7954) mem 7785MB

Memory (4 GPUs with bs_48, i think it has nothing with bs)

root:~/workspace/Swin-Transformer# free -g total used free shared buff/cache available Mem: 251 181 24 2 45 65 Swap: 0 0 0 root~/workspace/Swin-Transformer#

May 13 '22 15:05 OxInsky

wonder how to generate the zipped imageNet labels, e.g. train_map.txt ?

Jun 06 '22 01:06 ZJLi2013

Swin-Transformer Swin-Transformer copied to clipboard

zipped ImageNet processing scripts

+1, I found that it need more 100G+ memory during the data preparation when i was training the Swin-Transformer。it's unbelieveable. there is some information for this question.

it will up to the number of train and val images. should it cache the image data or the list. but my memory is out. So I think it's the image data.

run command follows:

My memory information as follow:(swap is equal the memory, but it was limited.)

I succeed! I allocated my memory up to 200G but cannot support 8GPU with bs_56 for training as well. it out of memory. So I set 4GPUs with bs48. it works！ maybe i should allocate more memory . Try it later.

Memory (4 GPUs with bs_48, i think it has nothing with bs)

Swin-Transformer
Swin-Transformer copied to clipboard