fairseq How to run FSDP on multiple nodes?

How to run FSDP on multiple nodes?

Open prajdabre opened this issue 2 years ago • 1 comments

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

The readme here shows how to run fairseq in FSDP mode on a single node. I expected that if I ran the same command but with different node ranks and total number of nodes, each node would be a replica. However, what this does is simply shard the model over all GPUs over all nodes. Naturally what I want is that the model be sharded over 1 node and then all nodes run do the same and then do distributed training. How would I go about achieving this?

Code

On node 1: OMP_NUM_THREADS=20 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8
--nnodes=2 --node_rank=0 --master_addr=127.123.1.1
--master_port=12345
fairseq-train data-bin/wikitext-103-roberta-bpe-bin
--ddp-backend fully_sharded --fp16 --fp16-init-scale 4
--cpu-offload --checkpoint-activations
--task language_modeling --tokens-per-sample 2048 --batch-size 8
--arch transformer_lm_gpt3_13
--optimizer cpu_adam --adam-betas "(0.9,0.98)"
--lr 0.0001 --lr-scheduler polynomial_decay --warmup-updates 5 --total-num-update 10
--max-update 10 --no-save --log-format json --log-interval 1

On node 2: OMP_NUM_THREADS=20 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch --nproc_per_node=8
--nnodes=2 --node_rank=1 --master_addr=127.123.1.1
--master_port=12345
fairseq-train data-bin/wikitext-103-roberta-bpe-bin
--ddp-backend fully_sharded --fp16 --fp16-init-scale 4
--cpu-offload --checkpoint-activations
--task language_modeling --tokens-per-sample 2048 --batch-size 8
--arch transformer_lm_gpt3_13
--optimizer cpu_adam --adam-betas "(0.9,0.98)"
--lr 0.0001 --lr-scheduler polynomial_decay --warmup-updates 5 --total-num-update 10
--max-update 10 --no-save --log-format json --log-interval 1

What have you tried?

I dont know what to try

What's your environment?

fairseq Version (e.g., 1.0 or main): Latest
PyTorch Version (e.g., 1.0): Latest
OS (e.g., Linux): Centos 8
How you installed fairseq (pip, source): source with local editable
Build command you used (if compiling from source):
Python version: N/A
CUDA/cuDNN version: N/A
GPU models and configuration: 8 V100 gpus, 2 nodes = 16 GPUs
Any other relevant information: N/A

Nov 15 '22 14:11 prajdabre

I believe there is no "sharding the model in one node and train with multiple nodes with parallel data distribution." While I would say sharding the model over all nodes is the correct way, because that is how you can train a several hundred B model.
Sorry for my ignorance, what is the advantage of data parallel here?

Nov 16 '22 10:11 gmryu

fairseq fairseq copied to clipboard

How to run FSDP on multiple nodes?

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

fairseq
fairseq copied to clipboard