llm-foundry
llm-foundry copied to clipboard
Multi-nodes slurm training
I wanna ask about training with slum command. I'm training 7b parameters model but apparently when i set the environment with more than one node it does see only one node. I used the default parameters.
sbatch file
#!/bin/bash
#SBATCH --job-name=train_llm
#SBATCH --output=../output/train_llm.log
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
composer train/train.py train/yamls/mpt/7b.yaml data_local=my-copy-c4 train_loader.dataset.split=train eval_loader.dataset.split=val max_duration=300ep eval_interval=0 save_folder=mpt-7b-Data-my-copy
7b.yaml
data_local: ./my-copy-c4
data_remote: # If blank, files must be present in data_local
max_seq_len: 2048
global_seed: 17
# Run Name
run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME
# Model
model:
name: mpt_causal_lm
init_device: meta
d_model: 4096
n_heads: 32
n_layers: 32
expansion_ratio: 4
max_seq_len: ${max_seq_len}
vocab_size: 50368
attn_config:
attn_impl: triton
tokenizer:
name: EleutherAI/gpt-neox-20b
kwargs:
model_max_length: ${max_seq_len}
# Dataloaders
train_loader:
name: text
dataset:
local: ${data_local}
remote: ${data_remote}
split: train
shuffle: true
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
drop_last: true
num_workers: 8
eval_loader:
name: text
dataset:
local: ${data_local}
remote: ${data_remote}
split: val
shuffle: false
max_seq_len: ${max_seq_len}
shuffle_seed: ${global_seed}
drop_last: false
num_workers: 8
# Optimization
scheduler:
name: cosine_with_warmup
t_warmup: 100ba
alpha_f: 0.1
optimizer:
name: decoupled_adamw
lr: 1.2e-4
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0
algorithms:
gradient_clipping:
clipping_type: norm
clipping_threshold: 1.0
max_duration: 63900ba # ~ 134B tokens
eval_interval: 150ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 124
# System
seed: ${global_seed}
device_eval_batch_size: 8
device_train_microbatch_size: 8
# device_train_microbatch_size: auto
precision: amp_bf16
# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
mixed_precision: PURE
activation_checkpointing: true
activation_checkpointing_reentrant: false
activation_cpu_offload: false
limit_all_gathers: true
verbose: false
# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}
runtime_estimator: {}
# loggers:
# wandb: {}
# Checkpoint to local filesystem or remote object store
# save_interval: 5000ba
# save_num_checkpoints_to_keep: 1 # Important, this cleans up checkpoints saved to DISK
# save_folder: ./{run_name}/checkpoints
# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints
# Load from local filesystem or remote object store
# load_path: ./gpt-7b/checkpoints/latest-rank{rank}.pt
# load_path: s3://my-bucket/my-folder/gpt-7b/checkpoints/latest-rank{rank}.pt
Hello @j-Gaow , for multi-node training, Composer expects several environment variables to be specified. For more details, see this notebook on submitting a multi-node job with Composer on SLURM using submitit: https://github.com/mosaicml/composer/blob/dev/examples/training_with_submitit.ipynb
Closing as stale -- please re-open if there are more issues.