ChatGLM-6B icon indicating copy to clipboard operation
ChatGLM-6B copied to clipboard

[BUG/Help] <title>有没有大神试过大batch的微调

Open white-wolf-tech opened this issue 2 years ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

4张卡,一共100W数据 参数设置: per_device_train_batch_size=4 gradient_accumulation_steps=128

这样batch就能大概有2048,三四天能跑完这100W数据。

但是实际中发现,loss降不动,一直230多左右徘徊。有没有大神遇到过啊

Expected Behavior

No response

Steps To Reproduce

no

Environment

- OS:
- Python:3.9.16
- Transformers:4.29.dev0
- PyTorch:2.0.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :11.8

Anything else?

No response

white-wolf-tech avatar Apr 20 '23 03:04 white-wolf-tech

per_device_train_batch_size太小了,调到100试下。你一个卡多少显存

Dragonkingpan avatar Apr 23 '23 09:04 Dragonkingpan

per_device_train_batch_size太小了,调到100试下。你一个卡多少显存

卡不行,A10。。。。带不动100,FP16,最大到4,现在堆了8张卡,分布式跑吧

white-wolf-tech avatar Apr 24 '23 10:04 white-wolf-tech