fairseq
fairseq copied to clipboard
How to run FSDP on multiple nodes?
❓ Questions and Help
Before asking:
- search the issues.
- search the docs.
What is your question?
The readme here shows how to run fairseq in FSDP mode on a single node. I expected that if I ran the same command but with different node ranks and total number of nodes, each node would be a replica. However, what this does is simply shard the model over all GPUs over all nodes. Naturally what I want is that the model be sharded over 1 node and then all nodes run do the same and then do distributed training. How would I go about achieving this?
Code
On node 1:
OMP_NUM_THREADS=20 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8
--nnodes=2 --node_rank=0 --master_addr=127.123.1.1
--master_port=12345
fairseq-train data-bin/wikitext-103-roberta-bpe-bin
--ddp-backend fully_sharded --fp16 --fp16-init-scale 4
--cpu-offload --checkpoint-activations
--task language_modeling --tokens-per-sample 2048 --batch-size 8
--arch transformer_lm_gpt3_13
--optimizer cpu_adam --adam-betas "(0.9,0.98)"
--lr 0.0001 --lr-scheduler polynomial_decay --warmup-updates 5 --total-num-update 10
--max-update 10 --no-save --log-format json --log-interval 1
On node 2:
OMP_NUM_THREADS=20 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8
--nnodes=2 --node_rank=1 --master_addr=127.123.1.1
--master_port=12345
fairseq-train data-bin/wikitext-103-roberta-bpe-bin
--ddp-backend fully_sharded --fp16 --fp16-init-scale 4
--cpu-offload --checkpoint-activations
--task language_modeling --tokens-per-sample 2048 --batch-size 8
--arch transformer_lm_gpt3_13
--optimizer cpu_adam --adam-betas "(0.9,0.98)"
--lr 0.0001 --lr-scheduler polynomial_decay --warmup-updates 5 --total-num-update 10
--max-update 10 --no-save --log-format json --log-interval 1
What have you tried?
I dont know what to try
What's your environment?
- fairseq Version (e.g., 1.0 or main): Latest
- PyTorch Version (e.g., 1.0): Latest
- OS (e.g., Linux): Centos 8
- How you installed fairseq (
pip
, source): source with local editable - Build command you used (if compiling from source):
- Python version: N/A
- CUDA/cuDNN version: N/A
- GPU models and configuration: 8 V100 gpus, 2 nodes = 16 GPUs
- Any other relevant information: N/A
I believe there is no "sharding the model in one node and train with multiple nodes with parallel data distribution."
While I would say sharding the model over all nodes is the correct way, because that is how you can train a several hundred B model.
Sorry for my ignorance, what is the advantage of data parallel here?