PatrickStar
PatrickStar copied to clipboard
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP and democratizes AI for everyone.
In the readme document, the installation seems wrong. Could you clarify the correct way of installation? Thanks
有办法让 BatchNorm2d 之类的层保持 float32 进行训练吗?用 half 可能导致 loss 不好收敛
Currently, the training will start whether the config of 2 nodes are the same or not. This may cause some weird result during benchmarking. We should consider communicate the config...
Hi! I am a newbie in this field. DeepSpeed provides a tutorial on GAN (https://www.deepspeed.ai/tutorials/gan/). I am curious about PatrickStar's performance in models like GANs or other CV models. I...
Add CI
We would like to have a CI to run unitests each time an MR proposed to branch develop and master. However, we currently have no idea how to find a...
The current profiler is messy and we have to reorganize these code. Memory and speed profiler for both PatrickStar and PyTorch.
TencentPretrain是TEG数据安全中心的repo,我们可以利用它们的模型结构和数据 https://git.woa.com/TencentNLP/TencentPretrain/merge_requests/61 TencentPretrain还有一个野生开源项目 https://github.com/dbiir/UER-py
MP的风潮是Megatron-LM引入到PTM训练中的,通过对transformer的实现插入定制的集合通信操作,实现了模型切分。 模型并行有很多诟病, 1. 在FWD和BWD都有大量的activations全局通信,通信量和batch size成正比。不仅通信量大于DP,还限制了batch size从而限制MP训练的计算负载规模,影响了计算性能(越大batch计算效率越高)。 2. MP需要对model定义代码进行定制修改。因此DeepSpeed的Example中也是在Megatron-LM基础上改的。有一些工作尝试简化这个修改工作,比如Mesh-TensorFlow和阿里巴巴的[Whale](https://arxiv.org/pdf/2011.09208.pdf),PyTorch似乎没有相关工作。如果从刷性能角度,这样并无大碍。如果从使用角度,算法同学不会接受的,因为推理端的代码还需要把自定义并行算子转化成PyTorch串行的。 3. 在HP(异构并行),MP,PP,DP等组合下,MP的用法已经非常局限,并有被替代之势。DeepSpeed吧MP被安排在节点内并行,PP和DP用在节点间。HP+DP的引入,让GPU内存墙被进一步打破,模型并行的主要优势正在被HP和ZeroDP代替,以后节点内是否继续用MP都不一定。 **MP and PatrickStar** 在PatrickStar中,显存的最大消耗量和chunk size有关,即使不使用异构存储空间,把所有chunk都放在gpu中,model data的尺寸也是原来的1/N,和MP消耗类似。PatrickStar和PP兼容即可,不需要兼容MP。 之前Zero-Offload会去兼容MP,这是很奇怪的。阅读代码,我觉得是因为Zero3的通信用了非常差的设计,需要临时在gpu分配world_size*tensor_numel大小的临时buffer,加上预取的存在,可能同时分配了多个这样的buffer,尤其对于embedding layer这种大参数层,可能会爆炸内存,因此需要用MP减少单个进程的tensor_numel。