awsome-distributed-training
awsome-distributed-training copied to clipboard
Example 10.FSDP reports 35b model created instead of 70b
The README recommends these hyperparameters to train a 70b model:
--num_key_value_heads=8
--llama_intermediate_size=28672
--hidden_width=8192
--num_layers=80
--num_heads=64
but the train script reports that it creates 35B model instead:
0: 2024-04-16 11:50:01 I [train.py:155] Creating Model
0: 2024-04-16 11:58:16 I [train.py:162] Created model with total parameters: 34549800960 (34.55 B)
Full command:
srun -l ./pt_fsdp_haha/bin/torchrun --nproc_per_node=8 --nnodes=2 --rdzv_id=324 --rdzv_backend=c10d --rdzv_endpoint=p4de-st-p4de-1 ./train.py --num_key_value_heads=8 --llama_intermediate_size=28672 --hidden_width=8192 --num_layers=80 --num_heads=64 --checkpoint_dir=/fsx/marcverd/awsome-distributed-training/3.test_cases/10.FSDP/chkpts --max_context_width=4096 --model_type=llama_v2 --tokenizer=hf-internal-testing/llama-tokenizer --checkpoint_freq=1 --validation_freq=500 --max_steps=4 --epochs=1 --dataset=c4 --dataset_config_name=en --train_batch_size=1 --val_batch_size=1 --sharding_strategy=full --offload_activations=1