Swin-Transformer
Swin-Transformer copied to clipboard
zipped ImageNet processing scripts
Hi, can you provide processing scripts for zipped ImageNet ?
+1, I found that it need more 100G+ memory during the data preparation when i was training the Swin-Transformer。it's unbelieveable. there is some information for this question.
Conditions: TAG: default TEST: CROP: true SEQUENTIAL: false THROUGHPUT_MODE: false TRAIN: ACCUMULATION_STEPS: 0 AUTO_RESUME: true BASE_LR: 0.0004375 CLIP_GRAD: 5.0 EPOCHS: 300 LR_SCHEDULER: DECAY_EPOCHS: 30 DECAY_RATE: 0.1 NAME: cosine MIN_LR: 4.3750000000000005e-06 OPTIMIZER: BETAS: - 0.9 - 0.999 EPS: 1.0e-08 MOMENTUM: 0.9 NAME: adamw START_EPOCH: 0 USE_CHECKPOINT: false WARMUP_EPOCHS: 20 WARMUP_LR: 4.375e-07 WEIGHT_DECAY: 0.05
global_rank 6 cached 0/1281167 takes 0.00s per block global_rank 3 cached 0/1281167 takes 0.00s per block global_rank 5 cached 0/1281167 takes 0.00s per block global_rank 2 cached 0/1281167 takes 0.00s per block global_rank 0 cached 0/1281167 takes 0.00s per block global_rank 7 cached 0/1281167 takes 0.00s per block global_rank 1 cached 0/1281167 takes 0.00s per block global_rank 4 cached 0/1281167 takes 0.00s per block global_rank 6 cached 128116/1281167 takes 52.54s per block global_rank 5 cached 128116/1281167 takes 52.40s per block global_rank 4 cached 128116/1281167 takes 51.70s per block global_rank 7 cached 128116/1281167 takes 52.25s per block global_rank 0 cached 128116/1281167 takes 52.32s per block global_rank 2 cached 128116/1281167 takes 52.33s per block global_rank 3 cached 128116/1281167 takes 52.48s per block global_rank 1 cached 128116/1281167 takes 52.20s per block global_rank 0 cached 256232/1281167 takes 25.78s per block global_rank 7 cached 256232/1281167 takes 25.78s per block global_rank 6 cached 256232/1281167 takes 25.78s per block global_rank 3 cached 256232/1281167 takes 25.78s per block global_rank 4 cached 256232/1281167 takes 25.78s per block global_rank 5 cached 256232/1281167 takes 25.78s per block global_rank 2 cached 256232/1281167 takes 25.78s per block global_rank 1 cached 256232/1281167 takes 25.78s per block``
it will up to the number of train and val images. should it cache the image data or the list. but my memory is out. So I think it's the image data.
The matter as follows:
global_rank 3 cached 640580/1281167 takes 28.99s per block
global_rank 0 cached 768696/1281167 takes 27.38s per block
global_rank 1 cached 768696/1281167 takes 27.37s per block
global_rank 2 cached 768696/1281167 takes 27.38s per block
global_rank 4 cached 768696/1281167 takes 27.38s per block
global_rank 7 cached 768696/1281167 takes 27.38s per block
global_rank 5 cached 768696/1281167 takes 27.38s per block
global_rank 6 cached 768696/1281167 takes 27.38s per block
global_rank 2 cached 896812/1281167 takes 27.98s per block
global_rank 6 cached 896812/1281167 takes 27.98s per block
global_rank 7 cached 896812/1281167 takes 27.98s per block
global_rank 4 cached 896812/1281167 takes 27.98s per block
global_rank 0 cached 896812/1281167 takes 27.99s per block
global_rank 1 cached 896812/1281167 takes 27.99s per block
Traceback (most recent call last):
File "/root/anaconda3/envs/swin/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/envs/swin/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/swin/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
run command follows:
dmesg -T | grep -E -i -B100 'killed process'
[Fri May 13 22:25:33 2022] Memory cgroup stats for /docker/a41cbab8ac5e0d3680e664e26bcc8890070bf4607a0744f00051fcc046a9ca1b: cache:345232KB rss:104203124KB rss_huge:0KB shmem:342340KB mapped_file:344784KB dirty:264KB writeback:0KB inactive_anon:62144KB active_anon:104487672KB inactive_file:1508KB active_file:40KB unevictable:0KB
[Fri May 13 22:25:33 2022] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Fri May 13 22:25:33 2022] [56913] 0 56913 271 1 32768 0 1000 docker-init
[Fri May 13 22:25:33 2022] [56985] 0 56985 4540 781 81920 0 1000 bash
[Fri May 13 22:25:33 2022] [ 538] 0 538 4540 514 73728 0 1000 bash
[Fri May 13 22:25:33 2022] [ 540] 0 540 1094 162 57344 0 1000 sleep
[Fri May 13 22:25:33 2022] [50918] 0 50918 16378 984 163840 0 1000 sshd
[Fri May 13 22:25:33 2022] [29230] 0 29230 23231 1686 212992 0 1000 sshd
[Fri May 13 22:25:33 2022] [29337] 0 29337 3220 485 69632 0 1000 sftp-server
[Fri May 13 22:25:33 2022] [52248] 0 52248 23235 1731 217088 0 1000 sshd
[Fri May 13 22:25:33 2022] [52268] 0 52268 4621 887 77824 0 1000 bash
[Fri May 13 22:25:33 2022] [21352] 0 21352 12791 2062 131072 0 1000 vim
[Fri May 13 22:25:33 2022] [25338] 0 25338 12791 2081 135168 0 1000 vim
[Fri May 13 22:25:33 2022] [17019] 0 17019 2404 654 65536 0 1000 bash
[Fri May 13 22:25:33 2022] [17274] 0 17274 2404 631 65536 0 1000 bash
[Fri May 13 22:25:33 2022] [ 2912] 0 2912 23199 1720 225280 0 1000 sshd
[Fri May 13 22:25:33 2022] [ 2926] 0 2926 4611 875 77824 0 1000 bash
[Fri May 13 22:25:33 2022] [30026] 0 30026 14506230 5336525 44224512 0 1000 python
[Fri May 13 22:25:33 2022] [30027] 0 30027 14499677 5329772 44138496 0 1000 python
[Fri May 13 22:25:33 2022] [30029] 0 30029 14553669 5350753 44294144 0 1000 python
[Fri May 13 22:25:33 2022] [30031] 0 30031 14511896 5342122 44257280 0 1000 python
[Fri May 13 22:25:33 2022] [30032] 0 30032 14511890 5342005 44220416 0 1000 python
[Fri May 13 22:25:33 2022] [44777] 0 44777 1094 167 53248 0 1000 sleep
[Fri May 13 22:25:33 2022] Memory cgroup out of memory: Kill process 30029 (python) score 1198 or sacrifice child
[Fri May 13 22:25:33 2022] Killed process 30029 (python) total-vm:58214676kB, anon-rss:20876164kB, file-rss:439200kB, shmem-rss:87648kB
My memory information as follow:(swap is equal the memory, but it was limited.)
root@a41cbab8ac5e:~# free -g total used free shared buff/cache available Mem: 251 91 123 0 36 157 Swap: 0 0 0 root@a41cbab8ac5e:~#
please help me, very thanks!
I succeed! I allocated my memory up to 200G but cannot support 8GPU with bs_56 for training as well. it out of memory. So I set 4GPUs with bs48. it works! maybe i should allocate more memory . Try it later.
[2022-05-13 23:36:32 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][860/6672] eta 0:47:01 lr 0.000421 time 0.5162 (0.4855) loss 5.1306 (4.9953) grad_norm 2.3225 (2.8221) mem 7785MB [2022-05-13 23:36:37 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][870/6672] eta 0:46:57 lr 0.000421 time 0.4454 (0.4856) loss 5.0256 (4.9957) grad_norm 3.3165 (2.8220) mem 7785MB [2022-05-13 23:36:42 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][880/6672] eta 0:46:51 lr 0.000421 time 0.4642 (0.4855) loss 5.5207 (4.9914) grad_norm 2.4980 (2.8192) mem 7785MB [2022-05-13 23:36:46 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][890/6672] eta 0:46:46 lr 0.000421 time 0.4609 (0.4853) loss 3.8626 (4.9899) grad_norm 2.3808 (2.8147) mem 7785MB [2022-05-13 23:36:51 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][900/6672] eta 0:46:40 lr 0.000421 time 0.4697 (0.4851) loss 5.3830 (4.9896) grad_norm 2.3411 (2.8122) mem 7785MB [2022-05-13 23:36:56 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][910/6672] eta 0:46:35 lr 0.000421 time 0.4822 (0.4852) loss 3.9433 (4.9898) grad_norm 2.6711 (2.8111) mem 7785MB [2022-05-13 23:37:01 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][920/6672] eta 0:46:31 lr 0.000421 time 0.4837 (0.4852) loss 5.8288 (4.9927) grad_norm 2.7145 (2.8098) mem 7785MB [2022-05-13 23:37:06 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][930/6672] eta 0:46:26 lr 0.000421 time 0.4757 (0.4852) loss 5.0991 (4.9949) grad_norm 2.9705 (2.8072) mem 7785MB [2022-05-13 23:37:10 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][940/6672] eta 0:46:20 lr 0.000421 time 0.4732 (0.4850) loss 5.0595 (4.9934) grad_norm 3.3814 (2.8068) mem 7785MB [2022-05-13 23:37:15 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][950/6672] eta 0:46:14 lr 0.000421 time 0.4738 (0.4849) loss 3.7327 (4.9873) grad_norm 2.5955 (2.8050) mem 7785MB [2022-05-13 23:37:20 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][960/6672] eta 0:46:08 lr 0.000421 time 0.4729 (0.4847) loss 5.1887 (4.9857) grad_norm 2.2804 (2.8031) mem 7785MB [2022-05-13 23:37:25 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][970/6672] eta 0:46:03 lr 0.000421 time 0.4771 (0.4846) loss 4.8851 (4.9864) grad_norm 3.8150 (2.8028) mem 7785MB [2022-05-13 23:37:29 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][980/6672] eta 0:45:57 lr 0.000421 time 0.4817 (0.4845) loss 5.2875 (4.9864) grad_norm 2.5579 (2.8007) mem 7785MB [2022-05-13 23:37:34 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][990/6672] eta 0:45:52 lr 0.000421 time 0.4870 (0.4844) loss 5.1356 (4.9867) grad_norm 2.7010 (2.8004) mem 7785MB [2022-05-13 23:37:39 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1000/6672] eta 0:45:46 lr 0.000421 time 0.4722 (0.4843) loss 4.7810 (4.9856) grad_norm 3.4885 (2.8016) mem 7785MB [2022-05-13 23:37:43 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1010/6672] eta 0:45:41 lr 0.000421 time 0.4764 (0.4841) loss 5.5549 (4.9877) grad_norm 2.4293 (2.7994) mem 7785MB [2022-05-13 23:37:48 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1020/6672] eta 0:45:35 lr 0.000421 time 0.4646 (0.4840) loss 5.3800 (4.9887) grad_norm 2.8842 (2.7980) mem 7785MB [2022-05-13 23:37:53 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1030/6672] eta 0:45:30 lr 0.000421 time 0.4699 (0.4839) loss 5.5460 (4.9855) grad_norm 2.5153 (2.7954) mem 7785MB
Memory (4 GPUs with bs_48, i think it has nothing with bs)
root:~/workspace/Swin-Transformer# free -g total used free shared buff/cache available Mem: 251 181 24 2 45 65 Swap: 0 0 0 root~/workspace/Swin-Transformer#
wonder how to generate the zipped imageNet labels, e.g. train_map.txt ?