ChatGLM-6B
ChatGLM-6B copied to clipboard
[BUG/Help] <title>有没有大神试过大batch的微调
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
4张卡,一共100W数据 参数设置: per_device_train_batch_size=4 gradient_accumulation_steps=128
这样batch就能大概有2048,三四天能跑完这100W数据。
但是实际中发现,loss降不动,一直230多左右徘徊。有没有大神遇到过啊
Expected Behavior
No response
Steps To Reproduce
no
Environment
- OS:
- Python:3.9.16
- Transformers:4.29.dev0
- PyTorch:2.0.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :11.8
Anything else?
No response
per_device_train_batch_size太小了,调到100试下。你一个卡多少显存
per_device_train_batch_size太小了,调到100试下。你一个卡多少显存
卡不行,A10。。。。带不动100,FP16,最大到4,现在堆了8张卡,分布式跑吧