ChatGLM-Tuning icon indicating copy to clipboard operation
ChatGLM-Tuning copied to clipboard

大概5小时可以训练完,但是loss一直是0,是正常的吗

Open TccccD opened this issue 1 year ago • 9 comments

大概5小时可以训练完,但是loss一直是0,是正常的吗

{'loss': 0.0, 'learning_rate': 1.9230769230769234e-07, 'epoch': 0.99} {'loss': 0.0, 'learning_rate': 1.730769230769231e-07, 'epoch': 0.99} {'loss': 0.0, 'learning_rate': 1.5384615384615387e-07, 'epoch': 0.99} {'loss': 0.0, 'learning_rate': 1.3461538461538464e-07, 'epoch': 0.99} {'loss': 0.0, 'learning_rate': 1.153846153846154e-07, 'epoch': 0.99} {'loss': 0.0, 'learning_rate': 9.615384615384617e-08, 'epoch': 1.0} {'loss': 0.0, 'learning_rate': 7.692307692307694e-08, 'epoch': 1.0} {'loss': 0.0, 'learning_rate': 5.76923076923077e-08, 'epoch': 1.0} {'loss': 0.0, 'learning_rate': 3.846153846153847e-08, 'epoch': 1.0} {'loss': 0.0, 'learning_rate': 1.9230769230769234e-08, 'epoch': 1.0} {'loss': 0.0, 'learning_rate': 0.0, 'epoch': 1.0}

@mymusise

TccccD avatar Mar 20 '23 01:03 TccccD

我这边训完loss大概在1左右,我的bs是4

mymusise avatar Mar 20 '23 01:03 mymusise

我这边训完loss大概在1左右,我的bs是4

我用的是bs只能等于1的版本,bs可以大于1的版本还改了什么吗

TccccD avatar Mar 20 '23 01:03 TccccD

以及--fp16 这个参数,如果加上的话,会报一个半精度的错,去掉的话就能够成功训练,loss=0会不会是这个原因?

Traceback (most recent call last): File "finetune.py", line 93, in main() File "finetune.py", line 85, in main trainer.train() File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2655, in training_step self.scaler.scale(loss).backward() File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward torch.autograd.backward( File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/function.py", line 267, in apply return user_fn(self, *args) File "/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 456, in backward grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype_A) RuntimeError: expected scalar type Half but found Float

TccccD avatar Mar 20 '23 02:03 TccccD

是不是显卡的问题?我用V100就报半精度的错,用3090就没事,真是奇怪,怀疑是算力的问题。你的显卡是什么型号的?

archwolf118 avatar Mar 20 '23 05:03 archwolf118

是不是显卡的问题?我用V100就报半精度的错,用3090就没事,真是奇怪,怀疑是算力的问题。你的显卡是什么型号的?

我的也是V100

TccccD avatar Mar 20 '23 07:03 TccccD

@TccccD 好像就是算力问题,v100不支持int8的方式load大模型训练,需要把finetune.py的60行的load_in_8bit改成False

archwolf118 avatar Mar 20 '23 07:03 archwolf118

@TccccD 好像就是算力问题,v100不支持int8的方式load大模型训练,需要把finetune.py的60行的load_in_8bit改成False

image

是看INT8 Tensor Cores 这个字段吗

TccccD avatar Mar 20 '23 08:03 TccccD

是bitsandbytes这个库影响的。(https://github.com/TimDettmers/bitsandbytes/issues/100)

archwolf118 avatar Mar 20 '23 08:03 archwolf118

是bitsandbytes这个库影响的。(TimDettmers/bitsandbytes#100)

我看评论说所有GPU都支持?但是bitsandbytes最新版不就是0.37.1吧。。。我就是这个版本的

TccccD avatar Mar 20 '23 08:03 TccccD

我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗? {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}

更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

Adherer avatar Mar 23 '23 12:03 Adherer

我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗? {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150}

更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

已经解决,只要开启fp16,loss就正常了,fp16为False,loss则一直为0

Adherer avatar Mar 23 '23 14:03 Adherer

单卡 V100 32GB: 开启 fp16, 模型导入时 load_in_8bit=False, batch_size < 3 可以运行 {'loss': 2.6075, 'learning_rate': 9.78e-05, 'epoch': 0.0}
{'loss': 1.9953, 'learning_rate': 9.53e-05, 'epoch': 0.0}
{'loss': 1.9127, 'learning_rate': 9.28e-05, 'epoch': 0.01}
{'loss': 1.8311, 'learning_rate': 9.03e-05, 'epoch': 0.01}
{'loss': 1.7649, 'learning_rate': 8.78e-05, 'epoch': 0.01}

xyzanonymous666 avatar Mar 29 '23 11:03 xyzanonymous666

我这边也是v100 16gb的 fp16训练不动,开了int8,显存是下来了,但是loss就是0,bitsandbytes 0.37.1,看对应的issue确实说都支持

chuckhope avatar Mar 30 '23 10:03 chuckhope

我用P40训练,batch_size等于1时,loss也是0,请问您解决了吗? {"epoch": 0.0, "learning_rate": 1.9980769230769233e-05, "loss": 0.0, "step": 50}, {"epoch": 0.0, "learning_rate": 1.9961538461538464e-05, "loss": 0.0, "step": 100}, {"epoch": 0.0, "learning_rate": 1.9942307692307695e-05, "loss": 0.0, "step": 150} 更新:batch_size等于2时,step=50时,loss不为0,后续都是0,感觉像是个bug {"epoch":0.0,"learning_rate":1.9980769230769233e-05,"loss":1.6446,"step":50}, {"epoch":0.0,"learning_rate":1.9961538461538464e-05,"loss":0.0,"step":100}, {"epoch":0.01,"learning_rate":1.9942307692307695e-05,"loss":0.0,"step":150}, {"epoch":0.01,"learning_rate":1.9923076923076926e-05,"loss":0.0,"step":200}

已经解决,只要开启fp16,loss就正常了,fp16为False,loss则一直为0

您好,p40不是不支持fp16吗

yuhp-zts avatar Feb 26 '24 10:02 yuhp-zts