PaLM-colossalai
PaLM-colossalai copied to clipboard
Scalable PaLM implementation of PyTorch
On a Multi GPU A100 system: $ cat CONFIG_FILE.py from colossalai.amp import AMP_TYPE SEQ_LENGTH = 512 BATCH_SIZE = 8 NUM_EPOCHS = 10 WARMUP_EPOCHS = 1 parallel = dict( tensor=dict(mode="1d", size=4),...
Using MR #41 The launching script is as follows. ``` env OMP_NUM_THREADS=12 torchrun --standalone --nproc_per_node=4 train.py --from_torch --config=configs/palm_8b_zero_2p5d_badcase.py ``` It failed after a few iterations. I prefer to attribute the...
  Above is the program operation log,its says torch.distributed.elastic.multipro cessing.errors.ChildFailedError. Can anybody know why it happen.Thanks!
why had it problem happen? Is it the wrong version of torch? 
Or do I need to use the PaLM_PyTorch by lucidrains? to run it efficiently?