Medusa icon indicating copy to clipboard operation
Medusa copied to clipboard

Medusa Training Loss

Open TomYang-TZ opened this issue 1 year ago • 9 comments

When utilizing Axolotl, the training loss reduces to 0 following the gradient accumulation steps. Is this expected behaviour? image

With Torchrun, the training loss consistently remains NaN. image

Thanks for the help!! Here is the training configuration: base_model: teknium/OpenHermes-2.5-Mistral-7B base_model_config: teknium/OpenHermes-2.5-Mistral-7B model_type: MistralForCausalLM tokenizer_type: LlamaTokenizer is_llama_derived_model: false

load_in_8bit: false load_in_4bit: false strict: false

datasets:

  • path: ShareGPT_Vicuna_unfiltered/ShareGPT_V4.3_unfiltered_cleaned_split.json type: sharegpt dataset_prepared_path: val_set_size: 0.1 output_dir: ./openhermes7B_medusa_stage1

sequence_len: 4096 sample_packing: true pad_to_sequence_len: true

wandb_project: wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:

gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 2 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 0.0005

train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: false

gradient_checkpointing: true early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: flash_attention: true use_reentrant: True

warmup_steps: 40 eval_steps: 0.01 evaluation_strategy: steps save_strategy: steps save_steps: save_total_limit: 1 debug: deepspeed: weight_decay: 0.0 fsdp: fsdp_config: special_tokens: bos_token: "" eos_token: "<|im_end|>" unk_token: ""

medusa_num_heads: 5 medusa_num_layers: 1 medusa_heads_coefficient: 0.2 medusa_decay_coefficient: 0.8 medusa_logging: true medusa_scheduler: constant medusa_lr_multiplier: 4.0 medusa_only_heads: true ddp_find_unused_parameters: true

TomYang-TZ avatar Apr 07 '24 15:04 TomYang-TZ

I am also facing the same issue with Mistral example listed in the repo.

vivekmadan2 avatar Apr 08 '24 19:04 vivekmadan2

same issue

FatPigeorz avatar Apr 10 '24 12:04 FatPigeorz

Have you solved this problem?

xiaoruirui356 avatar Apr 12 '24 04:04 xiaoruirui356

Unfortunately no

TomYang-TZ avatar Apr 12 '24 12:04 TomYang-TZ

I find some problems with the data,you can check it

xiaoruirui356 avatar May 06 '24 07:05 xiaoruirui356

Have you solved this problem?I'm facing the same problem

YuanlinChu avatar Feb 26 '25 07:02 YuanlinChu