LLaVA
LLaVA copied to clipboard
[Usage] tokenization mismatch when finetuning v1.5-7b
Describe the issue
Issue: I have found some threads reporting the tokenization mismatch problem, but I am still confused. I download the v1.5-7b weight from https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main , and finetune on datasets in the paper. I adapt the command line to make it run on V100. tokenizers.version == '0.14.1'
Command:
WANDB_MODE=disabled deepspeed llava/train/train.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path /path/to/llm_weights/llava-v1.5-7b \
--version v1 \
--data_path ./playground/data/llava_v1_5_mix665k.json \
--image_folder ./playground/data \
--vision_tower /path/to/llm_weights/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter /path/to/llm_weights/llava-v1.5-7b/mm_projector.bin \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 False \
--fp16 True \
--output_dir ./checkpoints/llava-v1.5-7b \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
Screenshots:
Same problem, I found </s>
was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?
Same problem, I found
</s>
was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?
The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)
I tried to fix this WARNING by:
cur_len = 1 + 1 # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1 # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1
Same problem, I found
</s>
was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?
I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>
. So what does 138 mean ? two more </s>
should be added ?
Same problem, I found
</s>
was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be
</s>
. So what does 138 mean ? two more</s>
should be added ?
You can check my last modification, the mismatch is due to the different tokenization results of 'USER' and the missing </s>
.
Same problem, I found
</s>
was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)
I tried to fix this WARNING by:
cur_len = 1 + 1 # 1 for bos, and 1 for compensating in the first round ... round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1 # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>" ... round_len = len(tokenizer(rou).input_ids) - 2 + 1
Awesome ! I tested your change, and it did work. So the problem is caused by both USER and </s>
. To clarify, the change should be made in method preprocess_v1 in llava/train/train.py.
@haotian-liu Please have a double check.
@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head".
So it seems the problem is caused by missing space before and after </s>
?
@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after
</s>
?
Yes, this will lead to different tokenization results with LLaMA tokenizer.
@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after
</s>
?Yes, this will lead to different tokenization results with LLaMA tokenizer.
For the above case, can the tokenizer correctly separate No (or other words) before </s>
? If not, the training would be harmed. So the better solution should be to modify the prompt.
And I tried to insert space before and after </s>
, but the mismatch showed again with original code.
Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.
You may run pip install "tokenizers>=0.12.1,<0.14"
, and try again. Thanks.
@yuyq96 Thanks for the fix, I'll take a look into this issue. This fix may cause issue with earlier tokenizer versions? I feel that there were some behavioral changes of the tokenizer.
Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.
You may run
pip install "tokenizers>=0.12.1,<0.14"
, and try again. Thanks.
Thanks, tokenizers downgrading to 0.12.1, and transformers to 4.31.0 solved the problem. I also tried inserting spaces before and after </s>
, and the warning showed again, don't know why extra spaces would not do.
@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. But don't know why mismatch when set "use_fast=False"
@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. But don't know why mismatch when set "use_fast=False"
@zzzzzzrc I tried to set "use_fast=True" and it works. But I'm not sure whether it will affect the final performance or not. Do you have any suggestion?
Is here fixed?
Is here fixed?
Setting use_fast=True works for my case.
@haotian-liu This is similar issue as what FastChat meet. The root cause is Huggingface introduce some bugs when dealing with added tokens. Please refer the fix here.
round_len = len(tokenizer(rou).input_ids), for each round,the tokenizer will add "bos"(bos of vicuna), so i wonder if the round_len caculation is right? Thanks
I encountered the “tokenization mismatch” issue during fine-tuning as well. Upon investigation, I found that it was primarily caused by the presence of empty strings in the “value” field of QA {"from": "human", "value": ""} int the dataset. As a result, the prompt became inclusive of the string “xxx USER:ASSISTANT: xxxx”. This led to the “tokenization mismatch” issue during the tokenization process. I’m not sure if this experience is useful, but I thought I’d share it.
Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.
Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?
Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.
Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?
Hi, I have the same issue. Have you solved it?
Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work. Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?
Hi, I have the same issue. Have you solved it?
Same when using lora to finetune v1.6-34b
I have fixed the issue, You just need to make sure the inputs and targets properly masked.
I have fixed the issue, You just need to make sure the inputs and targets properly masked.
Can you share your tokenizer settings?
Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work. Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?
Hi, I have the same issue. Have you solved it?
Same when using lora to finetune v1.6-34b
same when finetuning in 1.5b