xtuner
xtuner copied to clipboard
【Program hangs with no output.】
I am conducting the instruction tuning of llama3_llava using the script on my own dataset
NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero3_offload --seed 1024
. After the following output, the program stops outputting but is still running:
- mmengine - INFO - Iter(train) [ 10/23076] lr: 1.3034e-07 eta: 3 days, 3:30:35 time: 11.7851 data_time: 0.0298 memory: 15547 loss: nan
- mmengine - INFO - Iter(train) [ 20/23076] lr: 2.7506e-07 eta: 3 days, 5:46:56 time: 12.5050 data_time: 0.0199 memory: 9964 loss: nan
This state has been ongoing for 2 hours. What could be the possible cause for this?
@Luo-Z13 The total number of iterations is a bit strange. Did you modify the settings in config?
@Luo-Z13 The total number of iterations is a bit strange. Did you modify the settings in config?
My script:
NPROC_PER_NODE=${GPU_NUM} xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune \
--deepspeed deepspeed_zero3_offload --seed 1024
The training schedule:
# Scheduler & Optimizer
batch_size = 4 # per_device
accumulative_counts = 4
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
Then, I modify the save_steps, and change other parts about the paths to my own data or local paths. Besides that, there were no other changes.
@Luo-Z13 How many GPUs are you using for training?
@Luo-Z13 How many GPUs are you using for training?
I use 4*A100(40G)
@Luo-Z13 How many GPUs are you using for training?
And the pre-training of LLaVA-llama3 is normal.
@Luo-Z13
Under your configuration, the total dataset size is 4 * 4 * 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data?
@Luo-Z13
Under your configuration, the total dataset size is 4 * 4 * 23076 = 369216. However, the correct size of llava fine-tuning dataset is ~650000. This mismatch size seems a bit unusual. Have you modified the training data?
Hello, I'm using my own instruction-tuning data, so the total number of iterations is different. Do I need to check the format of my dataset?
@Luo-Z13
Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.
Additionally, here are some other suggestions:
- Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
- Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)
@Luo-Z13
Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.
Additionally, here are some other suggestions:
- Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
- Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)
Thank you very much, I will try them.
@Luo-Z13
Yes, that's possible. I suggest comparing your data format and content with llava's to see if the issue lies within the data.
Additionally, here are some other suggestions:
- Keep the global batch size at 128. In your case, consider setting the accumulative_counts to 8.
- Adjust the learning rate to 2e-5. (Of course, 1e-5 should also work fine.)
Thank you for your suggestions. The loss is now normal, but there is a new problem: after training a few batches, the iteration speed becomes very slow, as shown:
...
04/30 01:03:42 - mmengine - INFO - Iter(train) [ 100/23072] lr: 2.8656e-06 eta: 1 day, 13:54:11 time: 5.0159 data_time: 0.0117 memory: 9195 loss: 1.8088
04/30 01:04:31 - mmengine - INFO - Iter(train) [ 110/23072] lr: 3.1550e-06 eta: 1 day, 13:16:56 time: 4.8975 data_time: 0.0158 memory: 9167 loss: 1.3998
04/30 01:05:52 - mmengine - INFO - Iter(train) [ 120/23072] lr: 3.4444e-06 eta: 1 day, 14:26:13 time: 8.0493 data_time: 0.0101 memory: 9146 loss: 1.3203
04/30 01:06:42 - mmengine - INFO - Iter(train) [ 130/23072] lr: 3.7339e-06 eta: 1 day, 13:56:50 time: 5.0641 data_time: 0.0210 memory: 9125 loss: 1.2123
04/30 01:07:35 - mmengine - INFO - Iter(train) [ 140/23072] lr: 4.0233e-06 eta: 1 day, 13:37:29 time: 5.2818 data_time: 0.0184 memory: 9104 loss: 1.0494
04/30 03:07:10 - mmengine - INFO - Iter(train) [ 150/23072] lr: 4.3127e-06 eta: 14 days, 3:40:15 time: 717.5106 data_time: 0.0726 memory: 9090 loss: 0.8822
04/30 03:42:26 - mmengine - INFO - Iter(train) [ 160/23072] lr: 4.6022e-06 eta: 16 days, 18:28:43 time: 211.6158 data_time: 0.1037 memory: 9069 loss: 0.8258
04/30 06:08:50 - mmengine - INFO - Iter(train) [ 170/23072] lr: 4.8916e-06 eta: 29 days, 11:20:41 time: 878.3878 data_time: 0.1637 memory: 9055 loss: 0.7201
04/30 09:23:10 - mmengine - INFO - Iter(train) [ 180/23072] lr: 5.1810e-06 eta: 44 days, 23:40:10 time: 1165.9963 data_time: 0.1712 memory: 9041 loss: 0.7931
What could be the cause of this? @LZHgrla
@Luo-Z13 It seems to the fluctuations in machine performance. Can this issue be reliably reproduced and which commands you used?