nebuly icon indicating copy to clipboard operation
nebuly copied to clipboard

[Chatllama] Supervised Finetune on llama-7B

Open TonyZhanghm opened this issue 1 year ago • 10 comments

Hi! I downloaded the SHP dataset and was trying to run the actor training. I ran into several issues here with vanilla python, torchrun, and deepspeed.

TonyZhanghm avatar Mar 10 '23 23:03 TonyZhanghm

For python artifacts/main.py artifacts/config/config_new.yaml --type ACTOR, I got errors ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set for different ENV variables RANK, WORLD_SIZE, MASTER_ADDR. Wondering why would setup_model_parallel() be called if running the training on a single GPU?

TonyZhanghm avatar Mar 10 '23 23:03 TonyZhanghm

Then I tried torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR which would set up at the ENV variables. but got nan training loss image

TonyZhanghm avatar Mar 10 '23 23:03 TonyZhanghm

Also tried deepspeed artifacts/main.py artifacts/config/config_new.yaml --type ACTOR but got the assertion error below. Since there's no shardings for llama-7B checkpoints, does it mean world_size can only be 1?

Traceback (most recent call last):
  File "artifacts/main.py", line 54, in <module>
    actor_trainer = ActorTrainer(config.actor)
  File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/rlhf/actor.py", line 292, in __init__
    self.model = ActorModel(config)
  File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/rlhf/actor.py", line 54, in __init__
    self.model, self.tokenizer = load_model(
  File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/llama_model.py", line 598, in load_model
    checkpoint, params = load_checkpoints(ckpt_dir, local_rank, world_size)
  File "/var/lib/docker/persist/hzhang/nebullvm/apps/accelerate/chatllama/chatllama/llama_model.py", line 576, in load_checkpoints
    assert world_size == len(checkpoints), (
AssertionError: Loading a checkpoint for MP=1 but world size is 8

TonyZhanghm avatar Mar 10 '23 23:03 TonyZhanghm

Hello @TonyZhanghm, thank you very much for reaching out. I'll investigate the errors you are getting. Could you please share with us the config.yaml file you are currently using?

diegofiori avatar Mar 11 '23 10:03 diegofiori

I also ran into the issue that the training loss stayed nan... I am really looking forwards to your solutions ;)

cmnfriend avatar Mar 11 '23 12:03 cmnfriend

Then I tried torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR which would set up at the ENV variables. but got nan training loss image

May I ask your GPU memory specifications, I tested on the A10 and there will be a problem of cuda memory overflow.

young-chao avatar Mar 12 '23 12:03 young-chao

@diegofiori Here's the config, didn't change much but filling the weights and downloaded data

---
trainer_config:
  # learning rates
  actor_lr: 0.00001
  critic_lr: 0.00001
  # PPO Hyperparameters
  actor_eps_clip: 0.2
  critic_eps_clip: 0.2
  beta_s: 0.1
  # path to examples to be sampled (training dataset) see rlhf_dataset.json
  examples_path: "./SHP_datasets/rlhf_training_data.json"
  # number of episodes and generation performed for each episode
  # in the train() method
  num_episodes: 100
  max_timesteps: 32
  # number of timesteps after which the learn() method is called 
  # (to update the weights)
  update_timesteps: 32
  # number of example sampled at each timestep
  num_examples: 32
  # batch and epochs for the training
  batch_size: 1
  epochs: 1
  # number of learning steps (i.e. learn()) after which a checkpoint is saved
  update_checkpoint: 8
  checkpoint_folder: "./models/checkpoints"

actor_config:
  model: "llama-7B"
  model_path: "/persist/hzhang/llama_ckpt/7B/"
  checkpoint_folder: "./models"
  tokenizer_folder: "/persist/hzhang/llama_ckpt/tokenizer.model"
  train_dataset_path: "./SHP_datasets/actor_training_data.json"
  validation_dataset_path: null
  # froze model embedding during training
  froze_embeddings: True
  # use fairscale layers to build the model instead of vanilla pytorch
  use_fairscale: False
  # max sequence length for the actor (i.e. prompt + completion) it depends on
  # the model used.
  max_sequence_length: 1024
  # max tokens generated by the actor (completion only)
  max_tokens: 512
  # temperature for the actor
  temperature: 0.9
  batch_size: 1
  # number iteration after print
  iteration_per_print: 100
  lr: 0.0001
  epochs: 32
  # deepspeed settings
  deepspeed_enable: False
  deepspeed_config_path: "/persist/hzhang/nebullvm/apps/accelerate/chatllama/artifacts/config/ds_config.json"

TonyZhanghm avatar Mar 13 '23 18:03 TonyZhanghm

Then I tried torchrun artifacts/main.py artifacts/config/config_new.yaml --type ACTOR which would set up at the ENV variables. but got nan training loss image

May I ask your GPU memory specifications, I tested on the A10 and there will be a problem of cuda memory overflow.

I was on A100 80GB, with default batch size 1

TonyZhanghm avatar Mar 13 '23 18:03 TonyZhanghm

Hi @TonyZhanghm thanks for your input, we are debugging all your issues, a more stable version will be out soon. We are currently struggling to support all models LLaMA + HF.

PierpaoloSorbellini avatar Mar 14 '23 08:03 PierpaoloSorbellini

Have the same Qs: image

Ageliss avatar Mar 16 '23 11:03 Ageliss