axolotl
axolotl copied to clipboard
Axolotl has significantly higher train loss, longer train time compare with my training script.
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Two training process should be almost similar.
Current behavior
I am running on a small dataset (300 rows HenryJJ/tangshi) for testing Axolotl performance compared with trl SFTTrainer. My SFTTrainer testing script is very simple and its just a refactor compared with official trl library SFTTrainer example (https://github.com/huggingface/trl/blob/main/examples/scripts/sft.py). I run two training process with basically same setup:
- Axolotl:
python3 -m axolotl.cli.train llama2.ymlwith configs https://github.com/hengjiUSTC/learn-llm/blob/main/axolotl_configs/llama7b_tangshi.yml - My SFTTrainer script (https://github.com/hengjiUSTC/learn-llm/blob/main/trl_finetune.py#L492):
python3 trl_finetune.py --config configs/llama2_tangshi.ymlconfigs parameter is same with axolotl configs (https://github.com/hengjiUSTC/learn-llm/blob/main/configs/llama2_tangshi.yml)
The training result differs a lot. Axolotl is slower and converges to higher loss:
Axolotl result:
train loss: 2.8 eval loss 2.67 for 1 epoch. Train time 468s
My script result:
train loss: 2.09 eval loss 1.7 for 1 epoch. Train time 301s
On same dataset, same parameter setting, it looks like axolotl is producing different train loss and time compared with my script using SFTTrainer. The loss makes me feel that trained Lora is broken.
Am i not using Axolotl correctly? What is the cause of this result?
Steps to reproduce
-
Axolotl:
python3 -m axolotl.cli.train llama2.ymlwith configs https://github.com/hengjiUSTC/learn-llm/blob/main/axolotl_configs/llama7b_tangshi.yml -
My SFTTrainer script (https://github.com/hengjiUSTC/learn-llm/blob/main/trl_finetune.py#L492):
python3 trl_finetune.py --config configs/llama2_tangshi.ymlconfigs parameter is same with axolotl configs (https://github.com/hengjiUSTC/learn-llm/blob/main/configs/llama2_tangshi.yml)
Config yaml
base_model: NousResearch/Llama-2-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true
trust_remote_code: true
load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
- path: HenryJJ/tangshi
type: alpaca
# dataset_prepared_path: tangshi
val_set_size: 0.1
output_dir: tangshi-llama-2
sequence_len: 1024
sample_packing: false
pad_to_sequence_len: true
adapter: qlora
lora_model_dir:
lora_r: 64
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
# lora_modules_to_save:
# - embed_tokens
# - lm_head
wandb_project: llama2-axolotl-tangshi
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
max_grad_norm: 0.3
lr_scheduler: cosine
learning_rate: 1e-4
warmup_steps: 30
weight_decay: 0.05
train_on_inputs: false
group_by_length:
bf16: true
fp16: false
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 10
xformers_attention:
flash_attention: true
evals_per_epoch: 5
save_steps:
save_safetensors: false
save_total_limit: 2
debug: true
deepspeed:
fsdp:
fsdp_config:
# resize_token_embeddings_to_32x: true
special_tokens:
# eos_token: "<|im_end|>"
pad_token: "<unk>"
# tokens:
# - "<|im_start|>"
Possible solution
No sure
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
it's hard to say, with 300 rows, and 10% held out for the eval split, it could be randomness in the dataset that small that could lead to train loss differences. also, I don't know enough about the TRL trainer, but with your configuration, does trl calculate loss across the all the tokens? When I look at this, I'm not sure how trl is making out the input. If this is the case, I would definitely expect the loss to be different. https://github.com/hengjiUSTC/learn-llm/blob/main/configs/llama2_tangshi.yml#L48-L60
Just another insight.
sample_packing: true
Try enabling this to see the improvements.
@hengjiUSTC are you able to compare with the SFT trainer with proper label masking for instruct tuning?
Glad to make the comparison.
Some follow up questions hoping to grasp the picture here, I didn't get your previous response very well:
- What do you mean by proper label masking, I think you are suspecting trl's SFTTrainer is calculating loss with different method? I don't recall parameters in SFTtrainer related with this setting.
- Can you point me relevant code or doc in axolotl that related?
- Can you point me parameters in SFTTrainer(https://github.com/huggingface/trl/blob/v0.7.9/trl/trainer/sft_trainer.py#L53) or any docs I can check that might be good to differences?
You have completion only set to false with trl. You should start there. That should probably be true for that trainer to set the labels properly
Test after adding completion only to true.
SFTTrainer:
Axolotl:
Loss is almost equal!
Although training time is still differs, but I guess that might related to packing.
Closing as loss issue solved