LLaVA [Usage] tokenization mismatch when finetuning v1.5-7b

Describe the issue

Issue: I have found some threads reporting the tokenization mismatch problem, but I am still confused. I download the v1.5-7b weight from https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main , and finetune on datasets in the paper. I adapt the command line to make it run on V100. tokenizers.version == '0.14.1'

Command:

WANDB_MODE=disabled deepspeed llava/train/train.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path /path/to/llm_weights/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --image_folder ./playground/data \
    --vision_tower /path/to/llm_weights/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter /path/to/llm_weights/llava-v1.5-7b/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 False \
    --fp16 True \
    --output_dir ./checkpoints/llava-v1.5-7b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \

Screenshots: 企业微信截图_16982049679152

Oct 25 '23 03:10 Liu0329

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

Oct 25 '23 04:10 yuyq96

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)

I tried to fix this WARNING by:

cur_len = 1 + 1  # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1  # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1

Oct 25 '23 06:10 yuyq96

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>. So what does 138 mean ? two more </s> should be added ? 企业微信截图_1698216809301

Oct 25 '23 07:10 Liu0329

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>. So what does 138 mean ? two more </s> should be added ?

You can check my last modification, the mismatch is due to the different tokenization results of 'USER' and the missing </s>.

Oct 25 '23 07:10 yuyq96

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)

I tried to fix this WARNING by:
cur_len = 1 + 1  # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1  # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1

Awesome ! I tested your change, and it did work. So the problem is caused by both USER and </s>. To clarify, the change should be made in method preprocess_v1 in llava/train/train.py. @haotian-liu Please have a double check.

Oct 25 '23 07:10 Liu0329

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ? 企业微信截图_16982203487382

Oct 25 '23 07:10 Liu0329

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ?

Yes, this will lead to different tokenization results with LLaMA tokenizer.

Oct 25 '23 07:10 yuyq96

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ?

Yes, this will lead to different tokenization results with LLaMA tokenizer.

For the above case, can the tokenizer correctly separate No (or other words) before </s>? If not, the training would be harmed. So the better solution should be to modify the prompt. And I tried to insert space before and after </s>, but the mismatch showed again with original code.

Oct 25 '23 09:10 Liu0329

Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.

You may run pip install "tokenizers>=0.12.1,<0.14", and try again. Thanks.

Oct 25 '23 17:10 haotian-liu

@yuyq96 Thanks for the fix, I'll take a look into this issue. This fix may cause issue with earlier tokenizer versions? I feel that there were some behavioral changes of the tokenizer.

Oct 25 '23 17:10 haotian-liu

Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.

You may run pip install "tokenizers>=0.12.1,<0.14", and try again. Thanks.

Thanks, tokenizers downgrading to 0.12.1, and transformers to 4.31.0 solved the problem. I also tried inserting spaces before and after </s>, and the warning showed again, don't know why extra spaces would not do.

Oct 26 '23 03:10 Liu0329

@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. But don't know why mismatch when set "use_fast=False"

Nov 01 '23 14:11 zzzzzzrc

@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. But don't know why mismatch when set "use_fast=False"

@zzzzzzrc I tried to set "use_fast=True" and it works. But I'm not sure whether it will affect the final performance or not. Do you have any suggestion?

Nov 07 '23 08:11 GuoQiushan

Is here fixed?

Nov 18 '23 07:11 xiechengmude

Is here fixed?

Setting use_fast=True works for my case.

Nov 19 '23 08:11 GuoQiushan

@haotian-liu This is similar issue as what FastChat meet. The root cause is Huggingface introduce some bugs when dealing with added tokens. Please refer the fix here.

Dec 01 '23 01:12 ryusaeba

round_len = len(tokenizer(rou).input_ids)， for each round，the tokenizer will add "bos"(bos of vicuna), so i wonder if the round_len caculation is right? Thanks

Dec 03 '23 04:12 liuhaogeng

I encountered the “tokenization mismatch” issue during fine-tuning as well. Upon investigation, I found that it was primarily caused by the presence of empty strings in the “value” field of QA {"from": "human", "value": ""} int the dataset. As a result, the prompt became inclusive of the string “xxx USER:ASSISTANT: xxxx”. This led to the “tokenization mismatch” issue during the tokenization process. I’m not sure if this experience is useful, but I thought I’d share it.

Feb 02 '24 02:02 xxxwuwq

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.

Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Feb 20 '24 03:02 lucasjinreal

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.

Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Feb 22 '24 17:02 20191864218

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work. Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Same when using lora to finetune v1.6-34b

Feb 23 '24 21:02 charismaticchiu

I have fixed the issue, You just need to make sure the inputs and targets properly masked.

Feb 25 '24 06:02 lucasjinreal

I have fixed the issue, You just need to make sure the inputs and targets properly masked.

Can you share your tokenizer settings?

Feb 26 '24 16:02 BlueBlueFF

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work. Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Same when using lora to finetune v1.6-34b

same when finetuning in 1.5b

Apr 15 '24 02:04 gujiaqivadin

LLaVA LLaVA copied to clipboard

[Usage] tokenization mismatch when finetuning v1.5-7b

Describe the issue

LLaVA
LLaVA copied to clipboard