Xiao comments

Results 58 comments of


                                            Xiao

trafficstars

[BUG]: problem when run the [train.py](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/sequence_parallel/train.py#L76)

> I try this command `torchrun --nproc_per_node=4 train.py --synthetic 2>&1 | tee run.log` to run the [train.py](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/sequence_parallel/train.py） it doesn't work out

there maybe some bug about the train_gpt.py(https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py)

> ### 🐛 Describe the bug > I try to run a config by using the [train_gpt.py](https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py). I add a model on the [gpt.py](https://github.com/hpcaitech/Titans/blob/main/titans/model/gpt/gpt.py) . > > ``` > >...

there maybe some bug about the train_gpt.py(https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py)

my config is below. ``` from colossalai.amp import AMP_TYPE from titans.loss.lm_loss import GPTLMLoss from titans.model.gpt import gpt2_1_3B, gpt2_test4gpu350M from torch.optim import Adam BATCH_SIZE = 4 SEQ_LEN = 2048 #here the...

there maybe some bug about the train_gpt.py(https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/train_gpt.py)

> if I want to add parallel parallelism and sequence parallelism(SP) on the gpt, how should I run the code? I am confused by the different code and different document....

Finetune T5 11B and the process is killed . exits with return code = -9[BUG]

> Any help would be appreciated @tjruwase @stas00 It seems this is OOM. You memory used 513497

Finetune T5 11B and the process is killed . exits with return code = -9[BUG]

> > > Any help would be appreciated @tjruwase @stas00 > > > > > > It seems this is OOM. You memory used 513497 > > Yes, I know....

Finetune T5 11B and the process is killed . exits with return code = -9[BUG]

I do not use deepspeed to run 15B model. I use the alpa to run 15 model on 32GPUs

[BUG] use 8 32GB V100 and use_meta_tensor to inference big model. Cannot copy out of meta tensor; no data!

> My deepspeed version is 0.8.1 , my torch version is 1.13.1 and my transformer version is transformers==4.21.2. My CPU memory is 500GB > > I follow the [document](https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/text-generation) to...

[BUG] use 8 32GB V100 and use_meta_tensor to inference big model. Cannot copy out of meta tensor; no data!

> thanks, btw ,do you use zero to run deepspeed inference?

[BUG] use 8 32GB V100 and use_meta_tensor to inference big model. Cannot copy out of meta tensor; no data!

> @lambda7xx, please see example [bloom-ds-zero-inference.py](https://github.com/huggingface/transformers-bloom-inference/blob/main/bloom-inference-scripts/bloom-ds-zero-inference.py). I use this code to inference a bloom model, which is 176B model on 8 V100-32GB, The e2e time is 2000s, I think it's...