Baichuan-7B [Question] 预训练时间和预训练数据

Required prerequisites

[X] I have read the documentation https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md.
[X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[X] Consider asking first in a Discussion.

Questions

请问该模型在千卡集群上训练了多久啊?
README中提到了在大约 1.2T token上做了预训练，数据中语言的分布是怎样的啊? 感谢回复!

Checklist

[X] I have provided all relevant and necessary information above.
[X] I have chosen a suitable title for this issue.

Jun 16 '23 05:06 coye01

推算一下，7B模型，1.2万亿token，1000张A800，0.58利用率，训练一个epoch是4天左右。

Jun 16 '23 09:06 formath

推算一下，7B模型，1.2万亿token，1000张A800，0.58利用率，训练一个epoch是4天左右。

看配置好像是纯data parallel，没有开tensor parallel吗？

Jun 19 '23 06:06 miraclezqc

推算一下，7B模型，1.2万亿token，1000张A800，0.58利用率，训练一个epoch是4天左右。

看配置好像是纯data parallel，没有开tensor parallel吗？

猜测应该开了tensor和pipeline并行，否则很难达到0.58利用率

Jun 19 '23 06:06 formath

推算一下，7B模型，1.2万亿token，1000张A800，0.58利用率，训练一个epoch是4天左右。

看配置好像是纯data parallel，没有开tensor parallel吗？

猜测应该开了tensor和pipeline并行，否则很难达到0.58利用率

7B开pipeline应该不至于，tp开的话可能也是2，因为seq length为4096，按照global batch size为4M推测，micro batch size和 gradient accumulate都是1，那千卡应该是纯dp的，除非是2000卡。。

Jun 19 '23 06:06 miraclezqc

推算一下，7B模型，1.2万亿token，1000张A800，0.58利用率，训练一个epoch是4天左右。

看配置好像是纯data parallel，没有开tensor parallel吗？

猜测应该开了tensor和pipeline并行，否则很难达到0.58利用率

7B开pipeline应该不至于，tp开的话可能也是2，因为seq length为4096，按照global batch size为4M推测，micro batch size和 gradient accumulate都是1，那千卡应该是纯dp的，除非是2000卡。。

7B在80G上不用开TP，只需要sharding=8即可，多机间就是纯dp，这样训练速度和吞吐量应该都是最优的

Jun 21 '23 03:06 Luoyingfeng8

我想问下这个代码是把数据一次性加载进内存了，如果数据量很大1.4T tokens大概5T左右的数据量，是不是内存放不下呀。

Aug 14 '23 14:08 mynewstart

Baichuan-7B Baichuan-7B copied to clipboard

[Question] 预训练时间和预训练数据

Required prerequisites

Questions

Checklist

Baichuan-7B
Baichuan-7B copied to clipboard