你好,我在使用glm-10-chinese对自己数据集进行微调的时候,卡在了第1000个iteration不动了
上面是运行的语句 bash scripts/ds_finetune_seq2seq.sh config_tasks/model_blocklm_10B_chinese.sh config_tasks/seq_customization.sh
[2023-04-04 15:54:41,170] [INFO] [logging.py:75:log_dist] [Rank 0] step=950, skipped=4, lr=[4.965931757736944e-06, 4.965931757736944e-06], mom=[[0.9, 0.95], [0.9, 0.95]] [2023-04-04 15:54:41,186] [INFO] [timer.py:198:stop] epoch=0/micro_step=950/global_step=950, RunningAvgSamplesPerSec=0.8879320556862361, CurrSamplesPerSec=0.8937787087653021, MemAllocated=4.71GB, MaxMemAllocated=10.76GB time (ms) | forward: 3415.58 | backward: 9101.60 | allreduce: 0.00 | optimizer: 5408.99 | batch generator: 2.30 iteration 950/ 14080 | elapsed time per iteration (ms): 17928.4 | learning rate 4.966E-06 | lm loss 1.259398E+00 | loss scale 8192.0 | [2023-04-04 16:09:36,820] [INFO] [logging.py:75:log_dist] [Rank 0] step=1000, skipped=4, lr=[4.948931636847196e-06, 4.948931636847196e-06], mom=[[0.9, 0.95], [0.9, 0.95]] [2023-04-04 16:09:36,835] [INFO] [timer.py:198:stop] epoch=0/micro_step=1000/global_step=1000, RunningAvgSamplesPerSec=0.8882056657324592, CurrSamplesPerSec=0.8863697534402784, MemAllocated=4.71GB, MaxMemAllocated=10.76GB time (ms) | forward: 3406.61 | backward: 9100.21 | allreduce: 0.00 | optimizer: 5403.09 | batch generator: 1.75 iteration 1000/ 14080 | elapsed time per iteration (ms): 17913.0 | learning rate 4.949E-06 | lm loss 1.261758E+00 | loss scale 8192.0 |
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:CE:00.0 Off | N/A | | 30% 36C P2 139W / 350W | 21952MiB / 24576MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:D1:00.0 Off | N/A | | 30% 39C P2 140W / 350W | 21952MiB / 24576MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... On | 00000000:D5:00.0 Off | N/A | | 30% 39C P2 136W / 350W | 21952MiB / 24576MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... On | 00000000:D6:00.0 Off | N/A | | 30% 40C P2 141W / 350W | 21952MiB / 24576MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
将batch 变小了,同样会在1000次的时候卡住
你好,可以问下你的微调数据集甚麽样的格式吗
test.source test.target train.source train.target val.source val.target
格式也是遵循你们之前设计的格式,内容主要是中文的text2sql数据集 我发现应该是 config/seq_customization.sh 中的eval-interval 参数的问题,在参数为1000 的时候就卡住了,应该怎么解决
一样的问题,卡在eval-interval 迭代那一步,设多少就在多少卡着,卡在了model.backward(loss)这边
相似的问题,https://github.com/THUDM/GLM/issues/17 我把在进程不等于0的valid_data也置空None,就不会报错,但是不是解决bug的最根本的方法 if mpu.get_model_parallel_rank() != 0: args.train_iters_per_epoch = train_iters[0].item() args.train_iters = args.epochs * args.train_iters_per_epoch
train_dataloader = FakeDataloader(args.train_iters_per_epoch)
valid_dataloader = None
# if args.no_validation:
# valid_dataloader = None
# else:
# valid_dataloader = FakeDataloader(None)
那就是训练的时候不进行evaluate评估么