Jiarui Fang(方佳瑞)

Results 63 issues of Jiarui Fang(方佳瑞)

### Describe the feature In the file colossalai/utils/activation_checkpoint.py ``` def checkpoint(function, activation_offload, *args): """Checkpoint the computation while preserve the rng states, modified from Pytorch torch.utils.checkpoint. Args: function: Describe the forward...

enhancement

### Describe the feature I suggest adding CPU/Hybrid Adafactor supports for heterogeneous training. Adafactor has less memory consumption than Adam, proposed in paper `Adafactor: Adaptive Learning Rates with Sublinear Memory...

enhancement

Hello, I noticed you fix the lazy label bug and the getting-started.py is able to run. But it can not pass the assertion. The grad diff is quite large! assert...

We would like to have a CI to run unitests each time an MR proposed to branch develop and master. However, we currently have no idea how to find a...

help wanted

The current profiler is messy and we have to reorganize these code. Memory and speed profiler for both PatrickStar and PyTorch.

enhancement

TencentPretrain是TEG数据安全中心的repo,我们可以利用它们的模型结构和数据 https://git.woa.com/TencentNLP/TencentPretrain/merge_requests/61 TencentPretrain还有一个野生开源项目 https://github.com/dbiir/UER-py

documentation

MP的风潮是Megatron-LM引入到PTM训练中的,通过对transformer的实现插入定制的集合通信操作,实现了模型切分。 模型并行有很多诟病, 1. 在FWD和BWD都有大量的activations全局通信,通信量和batch size成正比。不仅通信量大于DP,还限制了batch size从而限制MP训练的计算负载规模,影响了计算性能(越大batch计算效率越高)。 2. MP需要对model定义代码进行定制修改。因此DeepSpeed的Example中也是在Megatron-LM基础上改的。有一些工作尝试简化这个修改工作,比如Mesh-TensorFlow和阿里巴巴的[Whale](https://arxiv.org/pdf/2011.09208.pdf),PyTorch似乎没有相关工作。如果从刷性能角度,这样并无大碍。如果从使用角度,算法同学不会接受的,因为推理端的代码还需要把自定义并行算子转化成PyTorch串行的。 3. 在HP(异构并行),MP,PP,DP等组合下,MP的用法已经非常局限,并有被替代之势。DeepSpeed吧MP被安排在节点内并行,PP和DP用在节点间。HP+DP的引入,让GPU内存墙被进一步打破,模型并行的主要优势正在被HP和ZeroDP代替,以后节点内是否继续用MP都不一定。 **MP and PatrickStar** 在PatrickStar中,显存的最大消耗量和chunk size有关,即使不使用异构存储空间,把所有chunk都放在gpu中,model data的尺寸也是原来的1/N,和MP消耗类似。PatrickStar和PP兼容即可,不需要兼容MP。 之前Zero-Offload会去兼容MP,这是很奇怪的。阅读代码,我觉得是因为Zero3的通信用了非常差的设计,需要临时在gpu分配world_size*tensor_numel大小的临时buffer,加上预取的存在,可能同时分配了多个这样的buffer,尤其对于embedding layer这种大参数层,可能会爆炸内存,因此需要用MP减少单个进程的tensor_numel。

documentation

### 🐛 Describe the bug Just run the `examples/language/opt/run_clm.py` will reproduce the error. The program crashed with no error information. After I replace placement_policy as 'cuda'. It is OK. ```...

### 🐛 Describe the bug I met overflow using the official scripts for GPT2. Is that a normal case? ``` cd XXX/ColossalAI/examples/language/gpt export DATA=/data/scratch/gpt_data/small-gpt-dataset.json torchrun --standalone --nproc_per_node=1 train_gpt.py --config=gpt2_configs/gpt2_zero3.py --from_torch...