ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: The accuracy of vit is very low

Open fearless1007 opened this issue 2 years ago • 2 comments

🐛 Describe the bug

question:When I trained using vit on the Imagenet-1k and Cifar-10 datasets, I repeatedly adjusted the parameter configuration according to the official vit configuration, but the accuracy was still very low and the loss value fluctuated repeatedly,whatever hook and engine

[06/15/23 18:36:59] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:97 after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 2 / Train]: Loss = 2.2553 | LR = 0.003 | Throughput = 88.011
[Epoch 2 / Test]: 100%|██████████| 625/625 [00:45<00:00, 13.70it/s, accuracy=0.25, loss=2.07, throughput=220.86 sample_per_sec]
[06/15/23 18:37:44] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:104 after_test_epoch
INFO colossalai - colossalai - INFO: [Epoch 2 / Test]:
Accuracy = 0.1674 | Loss = 2.2194 | Throughput =
221.39
[Epoch 3 / Train]: 100%|██████████| 3125/3125 [09:34<00:00, 5.44it/s, loss=2.33, lr=0.003, throughput=89.562 sample_per_sec] [06/15/23 18:47:19] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:97 after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 3 / Train]: Loss = 2.2532 | LR = 0.0029967 | Throughput =
87.392
[Epoch 3 / Test]: 100%|██████████| 625/625 [00:45<00:00, 13.64it/s, accuracy=0.3125, loss=2.04, throughput=220.79 sample_per_sec] [06/15/23 18:48:05] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:104 after_test_epoch
INFO colossalai - colossalai - INFO: [Epoch 3 / Test]:
Accuracy = 0.1535 | Loss = 2.1995 | Throughput =
220.33
[Epoch 4 / Train]: 100%|██████████| 3125/3125 [09:35<00:00, 5.43it/s, loss=2.29, lr=0.003, throughput=90.803 sample_per_sec] [06/15/23 18:57:40] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:97 after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 4 / Train]: Loss = 2.2508 | LR = 0.0029866 | Throughput =
87.229
[Epoch 4 / Test]: 100%|██████████| 625/625 [00:45<00:00, 13.66it/s, accuracy=0.4375, loss=2.07, throughput=221.52 sample_per_sec] [06/15/23 18:58:26] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:104 after_test_epoch
INFO colossalai - colossalai - INFO: [Epoch 4 / Test]:
Accuracy = 0.2011 | Loss = 2.1661 | Throughput =
220.66
[Epoch 5 / Train]: 100%|█████████▉| 3124/3125 [09:30<00:00, 5.60it/s, loss=2.08, lr=0.00299, throughput=89.878 sample_per_sec][06/15/23 19:07:57] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:97 after_train_epoch
[Epoch 5 / Train]: 100%|██████████| 3125/3125 [09:30<00:00, 5.47it/s, loss=2.08, lr=0.00299, throughput=89.878 sample_per_sec] INFO colossalai - colossalai - INFO: [Epoch 5 / Train]: Loss = 2.25 | LR = 0.0029699 | Throughput = 87.947 [Epoch 5 / Test]: 100%|██████████| 625/625 [00:45<00:00, 13.68it/s, accuracy=0.125, loss=2.07, throughput=220.96 sample_per_sec] [06/15/23 19:08:42] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:104 after_test_epoch
INFO colossalai - colossalai - INFO: [Epoch 5 / Test]:
Accuracy = 0.1559 | Loss = 2.1761 | Throughput =
221.07
[Epoch 6 / Train]: 100%|██████████| 3125/3125 [09:34<00:00, 5.44it/s, loss=2.4, lr=0.00297, throughput=83.413 sample_per_sec] [06/15/23 19:18:17] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:97 after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 6 / Train]: Loss = 2.2504 | LR = 0.0029467 | Throughput =
87.337
[Epoch 6 / Test]: 100%|█████████▉| 624/625 [00:45<00:00, 13.65it/s, accuracy=0.125, loss=2.19, throughput=220.65 sample_per_sec][06/15/23 19:19:03] INFO colossalai - colossalai - INFO:
/home/wangzhigangcs/anaconda3/envs/zxcolo39/lib/pyt hon3.9/site-packages/colossalai/trainer/hooks/log hook.py:104 after_test_epoch
[Epoch 6 / Test]: 100%|██████████| 625/625 [00:45<00:00, 13.65it/s, accuracy=0.125, loss=2.19, throughput=220.65 sample_per_sec] [Epoch 7 / Train]: 0%| | 0/3125 [00:00<?, ?it/s] INFO colossalai - colossalai - INFO: [Epoch 6 / Test]:
Accuracy = 0.1473 | Loss = 2.2289 | Throughput =
220.59

Environment

train_file:train_with_cifar10 and train_with_imagenet_1k

configuration: from colossalai.amp import AMP_TYPE

hyperparameters

BATCH_SIZE is as per GPU

global batch size = BATCH_SIZE x data parallel size

BATCH_SIZE = 16 LEARNING_RATE = 3e-3 WEIGHT_DECAY = 0.3 NUM_EPOCHS = 50 WARMUP_EPOCHS = 3

model config

IMG_SIZE = 224 PATCH_SIZE = 16 HIDDEN_SIZE = 1024 DEPTH = 24 NUM_HEADS = 16 MLP_RATIO = 4 NUM_CLASSES = 10 CHECKPOINT = False SEQ_LENGTH = (IMG_SIZE // PATCH_SIZE)**2 + 1 # add 1 for cls token

parallel setting

TENSOR_PARALLEL_SIZE = 8 TENSOR_PARALLEL_MODE = '1d' parallel = dict( pipeline=1, tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE), ) fp16 = dict(mode=AMP_TYPE.NAIVE) clip_grad_norm = 1.0

fearless1007 avatar Jun 15 '23 11:06 fearless1007

Colossal AI is not the primary cause of this result.

flybird11111 avatar Jun 20 '23 03:06 flybird11111

Hi did you solve this problem ? I'm having the same issues :c

jinlovespho avatar Apr 14 '24 12:04 jinlovespho