PaLM-colossalai icon indicating copy to clipboard operation
PaLM-colossalai copied to clipboard

Scalable PaLM implementation of PyTorch

Results 9 PaLM-colossalai issues
Sort by recently updated
recently updated
newest added

On a Multi GPU A100 system: $ cat CONFIG_FILE.py from colossalai.amp import AMP_TYPE SEQ_LENGTH = 512 BATCH_SIZE = 8 NUM_EPOCHS = 10 WARMUP_EPOCHS = 1 parallel = dict( tensor=dict(mode="1d", size=4),...

Using MR #41 The launching script is as follows. ``` env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=configs/palm_8b_zero_2p5d_badcase.py ``` It failed after a few iterations. I prefer to attribute the...

![paml错误截图1](https://user-images.githubusercontent.com/81227322/233546210-b182b1f6-43ec-45e4-80b2-6bfd32d60a36.png) ![palm错误截图2](https://user-images.githubusercontent.com/81227322/233546221-cf7f4c7b-7a49-4ec0-ab8a-baa51f92aa43.png) Above is the program operation log,its says torch.distributed.elastic.multipro cessing.errors.ChildFailedError. Can anybody know why it happen.Thanks!

why had it problem happen? Is it the wrong version of torch? ![图片](https://user-images.githubusercontent.com/81227322/233245566-991a8f69-f571-4164-8209-977112253098.png)

Or do I need to use the PaLM_PyTorch by lucidrains? to run it efficiently?