axolotl Axolotl has significantly higher train loss, longer train time compare with my training script.

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Two training process should be almost similar.

Current behavior

I am running on a small dataset (300 rows HenryJJ/tangshi) for testing Axolotl performance compared with trl SFTTrainer. My SFTTrainer testing script is very simple and its just a refactor compared with official trl library SFTTrainer example (https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py). I run two training process with basically same setup:

Axolotl: python3 -m axolotl.cli.train llama2.yml with configs https://github.com/hengjiUSTC/learn-llm/blob/main/axolotl_configs/llama7b_tangshi.yml
My SFTTrainer script (https://github.com/hengjiUSTC/learn-llm/blob/main/trl_finetune.py#L492): python3 trl_finetune.py --config configs/llama2_tangshi.yml configs parameter is same with axolotl configs (https://github.com/hengjiUSTC/learn-llm/blob/main/configs/llama2_tangshi.yml)

The training result differs a lot. Axolotl is slower and converges to higher loss: Axolotl result: train loss: 2.8 eval loss 2.67 for 1 epoch. Train time 468s 截屏2024-01-09 下午6 33 52 截屏2024-01-09 下午6 35 13 截屏2024-01-09 下午6 35 29 截屏2024-01-09 下午6 35 47

My script result: train loss: 2.09 eval loss 1.7 for 1 epoch. Train time 301s 截屏2024-01-09 下午6 39 27 截屏2024-01-09 下午7 17 46 截屏2024-01-09 下午7 17 55 截屏2024-01-09 下午7 18 05

On same dataset, same parameter setting, it looks like axolotl is producing different train loss and time compared with my script using SFTTrainer. The loss makes me feel that trained Lora is broken.

Am i not using Axolotl correctly? What is the cause of this result?

Steps to reproduce

Axolotl: python3 -m axolotl.cli.train llama2.yml with configs https://github.com/hengjiUSTC/learn-llm/blob/main/axolotl_configs/llama7b_tangshi.yml
My SFTTrainer script (https://github.com/hengjiUSTC/learn-llm/blob/main/trl_finetune.py#L492): python3 trl_finetune.py --config configs/llama2_tangshi.yml configs parameter is same with axolotl configs (https://github.com/hengjiUSTC/learn-llm/blob/main/configs/llama2_tangshi.yml)

Config yaml

base_model: NousResearch/Llama-2-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true
trust_remote_code: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: HenryJJ/tangshi
    type: alpaca

# dataset_prepared_path: tangshi
val_set_size: 0.1
output_dir: tangshi-llama-2

sequence_len: 1024
sample_packing: false  
pad_to_sequence_len: true

adapter: qlora
lora_model_dir:
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
# lora_modules_to_save:
#   - embed_tokens
#   - lm_head

wandb_project: llama2-axolotl-tangshi
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
max_grad_norm: 0.3
lr_scheduler: cosine
learning_rate: 1e-4
warmup_steps: 30
weight_decay: 0.05

train_on_inputs: false
group_by_length:
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 10
xformers_attention:
flash_attention: true

evals_per_epoch: 5
save_steps:
save_safetensors: false
save_total_limit: 2
debug: true
deepspeed:
fsdp:
fsdp_config:
# resize_token_embeddings_to_32x: true
special_tokens:
#   eos_token: "<|im_end|>"
  pad_token: "<unk>"
# tokens:
#   - "<|im_start|>"

Possible solution

No sure

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Jan 09 '24 11:01 hengjiUSTC

it's hard to say, with 300 rows, and 10% held out for the eval split, it could be randomness in the dataset that small that could lead to train loss differences. also, I don't know enough about the TRL trainer, but with your configuration, does trl calculate loss across the all the tokens? When I look at this, I'm not sure how trl is making out the input. If this is the case, I would definitely expect the loss to be different. https://github.com/hengjiUSTC/learn-llm/blob/main/configs/llama2_tangshi.yml#L48-L60

Jan 09 '24 14:01 winglian

Just another insight.

sample_packing: true

Try enabling this to see the improvements.

Jan 10 '24 05:01 NanoCode012

@hengjiUSTC are you able to compare with the SFT trainer with proper label masking for instruct tuning?

Jan 11 '24 13:01 winglian

Glad to make the comparison.

Some follow up questions hoping to grasp the picture here, I didn't get your previous response very well:

What do you mean by proper label masking, I think you are suspecting trl's SFTTrainer is calculating loss with different method? I don't recall parameters in SFTtrainer related with this setting.
Can you point me relevant code or doc in axolotl that related?
Can you point me parameters in SFTTrainer(https://github.com/huggingface/trl/blob/v0.7.9/trl/trainer/sft_trainer.py#L53) or any docs I can check that might be good to differences?

Jan 11 '24 14:01 hengjiUSTC

You have completion only set to false with trl. You should start there. That should probably be true for that trainer to set the labels properly

Jan 14 '24 19:01 winglian

Test after adding completion only to true. SFTTrainer: 截屏2024-01-15 下午10 40 53 截屏2024-01-15 下午10 41 00 截屏2024-01-15 下午10 44 35

Axolotl: 截屏2024-01-15 下午10 42 35 截屏2024-01-15 下午10 42 28 截屏2024-01-15 下午10 43 51

Loss is almost equal!

Although training time is still differs, but I guess that might related to packing.

Jan 15 '24 14:01 hengjiUSTC

Closing as loss issue solved

Mar 30 '24 18:03 NanoCode012

axolotl axolotl copied to clipboard

Axolotl has significantly higher train loss, longer train time compare with my training script.

Please check that this issue hasn't been reported before.

Expected Behavior

Current behavior

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard