FlagEmbedding long-llm run for more than 1 epoch

If we following the script setting of long-llm, the parameter num_train_epoch is set to 1, it will give out really significant improvment over the original model. However, if we change the paramter to larger than 1 ( i've tried 2, 3). The resulting model is total garbage. The first picture shows the prediction of some prompt using the model being trained for only 1 epoch. The second picture shows the same prompt's prediction using the model with 3 epochs. Something is not right here, as i don't believe more epochs will lead to dramatically worse result.

In addition, i've tried the following:

train the lora adapter based on the original model for 1 epoch
merge the adapter back to the original model, call it model 1
train another lora adapter based on model 1 for 1 epoch also
merge back the adapter to model 1, call it model 2
evaluate both model 1 and model 2, model 1 shows really good result compare to original base model, whereas model 2 is also a garbage, spitting out nonsense, repetitive results ( put the picture down below) Does anyone know why this happens? All other parameters remain the same for all the experiments. Seems like this long-llm script is only working on ONLY 1 epoch setting, weird.

Really appreciate if someone could provide some insight on this, thanks in advance.

Jun 04 '24 02:06 disperaller

This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?

Jun 10 '24 14:06 namespace-Pt

This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?

Hi i overwrote the log with that 1-epoch model's training log, but if you try the following setup, i believe you could reproduce the error:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export NCCL_BLOCKING_WAIT=0 export NCCL_DEBUG=INFO export OMP_NUM_THREADS=1

output_name=l3_8b_1epoch_8gpu_t1

unsloth/bin/torchrun
--master_addr localhost
--master_port 6667
--nnodes 1
--node_rank 0
--nproc_per_node 8
train.py
--data_root data/long-llm
--output_dir model/$output_name
--model_name_or_path /mnt/cpfs-data/mashiyao/MODEL/LLaMa3-Instruct
--train_data "long-llm:gpt/one_detail_book.train.64K.json long-llm:gpt/one_detail_paper.train.64K.json long-llm:gpt/multi_detail_book.train.json long-llm:gpt/multi_detail_paper_short.train.json long-llm:gpt/multi_detail_paper_long.train.json long-llm:gpt/bio_book.train.json long-llm:longalpaca/train.json long-llm:redpajama/train.json[5000]"
--max_length 81920
--group_by_length
--rope_theta 200e6
--attn_impl flash_attention_2
--gradient_checkpointing
--use_reentrant True
--learning_rate 5e-5
--num_train_epochs 2
--save_only_model
--save_strategy epoch
--logging_steps 5
--bf16
--lora_tune
--lora_extra_params embed_tokens
--load_in_4_bit
--chat_template llama-3

Jun 11 '24 02:06 disperaller

Also, i wanna ask if this training method is able to scale to 70b llama3 instruct model using the same hyper-parameter setting?

Jun 21 '24 08:06 disperaller

I think it would work. Please report your result here if you would like to have a try :)

Jun 21 '24 17:06 namespace-Pt

This is a weird issue. I'll try it myself very soon. Could you please share your training log? Is the loss for more than 1 epoch normal?

@namespace-Pt Hi, have you tried finetuning in this way and encounter the same issue?

Jul 25 '24 05:07 disperaller

FlagEmbedding FlagEmbedding copied to clipboard

long-llm run for more than 1 epoch

FlagEmbedding
FlagEmbedding copied to clipboard