llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Multi-nodes slurm training

Open j-Gaow opened this issue 2 years ago • 1 comments

I wanna ask about training with slum command. I'm training 7b parameters model but apparently when i set the environment with more than one node it does see only one node. I used the default parameters.

sbatch file

#!/bin/bash

#SBATCH --job-name=train_llm
#SBATCH --output=../output/train_llm.log
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
composer train/train.py train/yamls/mpt/7b.yaml data_local=my-copy-c4  train_loader.dataset.split=train eval_loader.dataset.split=val max_duration=300ep eval_interval=0 save_folder=mpt-7b-Data-my-copy

7b.yaml

data_local: ./my-copy-c4
data_remote: # If blank, files must be present in data_local
max_seq_len: 2048
global_seed: 17

# Run Name
run_name: # If left blank, will be read from env var $COMPOSER_RUN_NAME

# Model
model:
  name: mpt_causal_lm
  init_device: meta
  d_model: 4096
  n_heads: 32
  n_layers: 32
  expansion_ratio: 4
  max_seq_len: ${max_seq_len}
  vocab_size: 50368
  attn_config:
    attn_impl: triton


tokenizer:
  name: EleutherAI/gpt-neox-20b
  kwargs:
    model_max_length: ${max_seq_len}

# Dataloaders
train_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: train
    shuffle: true
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
  drop_last: true
  num_workers: 8

eval_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: val
    shuffle: false
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
  drop_last: false
  num_workers: 8

# Optimization
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1

optimizer:
  name: decoupled_adamw
  lr: 1.2e-4
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0

algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0

max_duration: 63900ba # ~ 134B tokens
eval_interval: 150ba
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 124

# System
seed: ${global_seed}
device_eval_batch_size: 8
device_train_microbatch_size: 8
# device_train_microbatch_size: auto
precision: amp_bf16

# FSDP
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true
  verbose: false

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}

# loggers:
#   wandb: {}

# Checkpoint to local filesystem or remote object store
# save_interval: 5000ba
# save_num_checkpoints_to_keep: 1  # Important, this cleans up checkpoints saved to DISK
# save_folder: ./{run_name}/checkpoints
# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoints

# Load from local filesystem or remote object store
# load_path: ./gpt-7b/checkpoints/latest-rank{rank}.pt
# load_path: s3://my-bucket/my-folder/gpt-7b/checkpoints/latest-rank{rank}.pt

j-Gaow avatar May 29 '23 08:05 j-Gaow

Hello @j-Gaow , for multi-node training, Composer expects several environment variables to be specified. For more details, see this notebook on submitting a multi-node job with Composer on SLURM using submitit: https://github.com/mosaicml/composer/blob/dev/examples/training_with_submitit.ipynb

hanlint avatar May 29 '23 16:05 hanlint

Closing as stale -- please re-open if there are more issues.

hanlint avatar Jul 23 '23 21:07 hanlint