A100 8GPU machine with NVLink connections;
docker image: nvcr.io/nvidia/pytorch:23.12-py3;
git clone https://github.com/foundation-model-stack/fms-fsdp.git
git clone https://github.com/foundation-model-stack/foundation-model-stack.git
git clone https://github.com/huggingface/optimum-nvidia.git
cd foundation-model-stack
pip install -e .
cd ../fms-fsdp/
pip install -r requirements.txt
cd ../optimum-nvidia
pip install -e .
cd ../fms-fsdp
export datastore_path=/root
export MODEL_ARGS=
--use_dummy_dataset=True
--ckpt_load_path=$datastore_path/pretrain/ckpt
--ckpt_save_path=$datastore_path/pretrain/ckpt
--fsdp_activation_checkpointing=False
--selective_checkpointing=1
--low_cpu_fsdp=False
--batch_size=1
--report_interval=200
--checkpoint_interval=20000
--use_torch_compile=False
--use_profiler=False
--model_variant=llama2_7b
python -m torch.distributed.launch --nproc_per_node=8 main_training.py ${MODEL_ARGS}
o valid checkpoint detected at /root/pretrain/ckpt/checkpoints/, starting from scratch.
Training for 1000000 steps
[rank5]: Traceback (most recent call last):
[rank5]: File "/workspace/fms-fsdp/main_training.py", line 164, in
[rank5]: fire.Fire(main)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
[rank5]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
[rank5]: component, remaining_args = _CallAndUpdateTrace(
[rank5]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
[rank5]: component = fn(*varargs, **kwargs)
[rank5]: File "/workspace/fms-fsdp/main_training.py", line 145, in main
[rank5]: train(
[rank5]: File "/workspace/fms-fsdp/fms_fsdp/utils/train_utils.py", line 92, in train
[rank5]: loss.backward()
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 525, in backward
[rank5]: torch.autograd.backward(
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py", line 267, in backward
[rank5]: _engine_run_backward(
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]: return func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 749, in _post_backward_hook
[rank5]: _reduce_grad(state, handle)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_runtime_utils.py", line 855, in _reduce_grad
[rank5]: dist.all_reduce(new_sharded_grad, group=state._inter_node_pg)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank5]: return func(*args, **kwargs)
[rank5]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2219, in all_reduce
[rank5]: work = group.allreduce([tensor], opts)
[rank5]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank5]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank5]: Last error:
[rank5]: Error while creating shared memory segment /dev/shm/nccl-pamwAh (size 5767520)