fms-fsdp icon indicating copy to clipboard operation
fms-fsdp copied to clipboard

FMS-FSDP running on A100 8GPU machine failed with NCCL error messages

Open htang2012 opened this issue 8 months ago • 0 comments

A100 8GPU machine with NVLink connections; docker image: nvcr.io/nvidia/pytorch:23.12-py3;

git clone https://github.com/foundation-model-stack/fms-fsdp.git git clone https://github.com/foundation-model-stack/foundation-model-stack.git git clone https://github.com/huggingface/optimum-nvidia.git

cd foundation-model-stack pip install -e . cd ../fms-fsdp/ pip install -r requirements.txt cd ../optimum-nvidia pip install -e . cd ../fms-fsdp

export datastore_path=/root

export MODEL_ARGS= --use_dummy_dataset=True --ckpt_load_path=$datastore_path/pretrain/ckpt --ckpt_save_path=$datastore_path/pretrain/ckpt --fsdp_activation_checkpointing=False --selective_checkpointing=1 --low_cpu_fsdp=False --batch_size=1 --report_interval=200 --checkpoint_interval=20000 --use_torch_compile=False --use_profiler=False --model_variant=llama2_7b

python -m torch.distributed.launch --nproc_per_node=8 main_training.py ${MODEL_ARGS}

o valid checkpoint detected at /root/pretrain/ckpt/checkpoints/, starting from scratch. Training for 1000000 steps [rank5]: Traceback (most recent call last): [rank5]: File "/workspace/fms-fsdp/main_training.py", line 164, in [rank5]: fire.Fire(main) [rank5]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire [rank5]: component_trace = _Fire(component, args, parsed_flag_args, context, name) [rank5]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire [rank5]: component, remaining_args = _CallAndUpdateTrace( [rank5]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace [rank5]: component = fn(*varargs, **kwargs) [rank5]: File "/workspace/fms-fsdp/main_training.py", line 145, in main [rank5]: train( [rank5]: File "/workspace/fms-fsdp/fms_fsdp/utils/train_utils.py", line 92, in train [rank5]: loss.backward() [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 525, in backward [rank5]: torch.autograd.backward( [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 267, in backward [rank5]: _engine_run_backward( [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [rank5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context [rank5]: return func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 749, in _post_backward_hook [rank5]: _reduce_grad(state, handle) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 855, in _reduce_grad [rank5]: dist.all_reduce(new_sharded_grad, group=state._inter_node_pg) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper [rank5]: return func(*args, **kwargs) [rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce [rank5]: work = group.allreduce([tensor], opts) [rank5]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5 [rank5]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. [rank5]: Last error: [rank5]: Error while creating shared memory segment /dev/shm/nccl-pamwAh (size 5767520)

htang2012 avatar Jun 11 '24 19:06 htang2012