nebuly
nebuly copied to clipboard
[Chatllama] Supervised Finetune on llama-7B
Hi! I downloaded the SHP dataset and was trying to run the actor training. I ran into several issues here with vanilla python, torchrun, and deepspeed.
For python artifacts/main.py artifacts/config/config_new.yaml --type ACTOR
, I got errors ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
for different ENV variables RANK, WORLD_SIZE, MASTER_ADDR
. Wondering why would setup_model_parallel()
be called if running the training on a single GPU?
Then I tried torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR
which would set up at the ENV variables. but got nan training loss
Also tried deepspeed artifacts/main.py artifacts/config/config_new.yaml --type ACTOR
but got the assertion error below. Since there's no shardings for llama-7B checkpoints, does it mean world_size can only be 1?
Traceback (most recent call last):
File "artifacts/main.py", line 54, in <module>
actor_trainer = ActorTrainer(config.actor)
File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/rlhf/actor.py", line 292, in __init__
self.model = ActorModel(config)
File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/rlhf/actor.py", line 54, in __init__
self.model, self.tokenizer = load_model(
File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/llama_model.py", line 598, in load_model
checkpoint, params = load_checkpoints(ckpt_dir, local_rank, world_size)
File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/llama_model.py", line 576, in load_checkpoints
assert world_size == len(checkpoints), (
AssertionError: Loading a checkpoint for MP=1 but world size is 8
Hello @TonyZhanghm, thank you very much for reaching out. I'll investigate the errors you are getting. Could you please share with us the config.yaml
file you are currently using?
I also ran into the issue that the training loss stayed nan... I am really looking forwards to your solutions ;)
Then I tried
torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR
which would set up at the ENV variables. but got nan training loss
May I ask your GPU memory specifications, I tested on the A10 and there will be a problem of cuda memory overflow.
@diegofiori Here's the config, didn't change much but filling the weights and downloaded data
---
trainer_config:
# learning rates
actor_lr: 0.00001
critic_lr: 0.00001
# PPO Hyperparameters
actor_eps_clip: 0.2
critic_eps_clip: 0.2
beta_s: 0.1
# path to examples to be sampled (training dataset) see rlhf_dataset.json
examples_path: "./SHP_datasets/rlhf_training_data.json"
# number of episodes and generation performed for each episode
# in the train() method
num_episodes: 100
max_timesteps: 32
# number of timesteps after which the learn() method is called
# (to update the weights)
update_timesteps: 32
# number of example sampled at each timestep
num_examples: 32
# batch and epochs for the training
batch_size: 1
epochs: 1
# number of learning steps (i.e. learn()) after which a checkpoint is saved
update_checkpoint: 8
checkpoint_folder: "./models/checkpoints"
actor_config:
model: "llama-7B"
model_path: "/persist/hzhang/llama_ckpt/7B/"
checkpoint_folder: "./models"
tokenizer_folder: "/persist/hzhang/llama_ckpt/tokenizer.model"
train_dataset_path: "./SHP_datasets/actor_training_data.json"
validation_dataset_path: null
# froze model embedding during training
froze_embeddings: True
# use fairscale layers to build the model instead of vanilla pytorch
use_fairscale: False
# max sequence length for the actor (i.e. prompt + completion) it depends on
# the model used.
max_sequence_length: 1024
# max tokens generated by the actor (completion only)
max_tokens: 512
# temperature for the actor
temperature: 0.9
batch_size: 1
# number iteration after print
iteration_per_print: 100
lr: 0.0001
epochs: 32
# deepspeed settings
deepspeed_enable: False
deepspeed_config_path: "/persist/hzhang/nebullvm/apps/accelerate/chatllama/artifacts/config/ds_config.json"
Then I tried
torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR
which would set up at the ENV variables. but got nan training lossMay I ask your GPU memory specifications, I tested on the A10 and there will be a problem of cuda memory overflow.
I was on A100 80GB, with default batch size 1
Hi @TonyZhanghm thanks for your input, we are debugging all your issues, a more stable version will be out soon. We are currently struggling to support all models LLaMA + HF.
Have the same Qs: