LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

[Usage] tokenization mismatch when finetuning v1.5-7b

Open Liu0329 opened this issue 1 year ago • 23 comments

Describe the issue

Issue: I have found some threads reporting the tokenization mismatch problem, but I am still confused. I download the v1.5-7b weight from https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main , and finetune on datasets in the paper. I adapt the command line to make it run on V100. tokenizers.version == '0.14.1'

Command:

WANDB_MODE=disabled deepspeed llava/train/train.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path /path/to/llm_weights/llava-v1.5-7b \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --image_folder ./playground/data \
    --vision_tower /path/to/llm_weights/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter /path/to/llm_weights/llava-v1.5-7b/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 False \
    --fp16 True \
    --output_dir ./checkpoints/llava-v1.5-7b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \

Screenshots: 企业微信截图_16982049679152

Liu0329 avatar Oct 25 '23 03:10 Liu0329

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

yuyq96 avatar Oct 25 '23 04:10 yuyq96

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)

I tried to fix this WARNING by:

cur_len = 1 + 1  # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1  # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1

yuyq96 avatar Oct 25 '23 06:10 yuyq96

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>. So what does 138 mean ? two more </s> should be added ? 企业微信截图_1698216809301

Liu0329 avatar Oct 25 '23 07:10 Liu0329

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

I did some debug. In this case, 136 is the input_ids' actual length, and there are three 2 in input_ids, which should be </s>. So what does 138 mean ? two more </s> should be added ? 企业微信截图_1698216809301

You can check my last modification, the mismatch is due to the different tokenization results of 'USER' and the missing </s>.

yuyq96 avatar Oct 25 '23 07:10 yuyq96

Same problem, I found </s> was not included in calculating the round_len, since it is used to split the rounds. Might be a problem that eos token is not automatically added?

The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head (with an automatically added bos token)

I tried to fix this WARNING by:

cur_len = 1 + 1  # 1 for bos, and 1 for compensating in the first round
...
round_len = len(tokenizer_image_token(rou, tokenizer)) - 2 + 1  # -2 for the extra tokens in tokenizing "USER", +1 for the missing "</s>"
...
round_len = len(tokenizer(rou).input_ids) - 2 + 1

Awesome ! I tested your change, and it did work. So the problem is caused by both USER and </s>. To clarify, the change should be made in method preprocess_v1 in llava/train/train.py. @haotian-liu Please have a double check.

Liu0329 avatar Oct 25 '23 07:10 Liu0329

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ? 企业微信截图_16982203487382

Liu0329 avatar Oct 25 '23 07:10 Liu0329

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ? 企业微信截图_16982203487382

Yes, this will lead to different tokenization results with LLaMA tokenizer.

yuyq96 avatar Oct 25 '23 07:10 yuyq96

@yuyq96 "The truth is "USER" will be tokenized as [11889] in the middle of the prompt, but tokenized as [1, 3148, 1001] in the head". So it seems the problem is caused by missing space before and after </s> ? 企业微信截图_16982203487382

Yes, this will lead to different tokenization results with LLaMA tokenizer.

For the above case, can the tokenizer correctly separate No (or other words) before </s>? If not, the training would be harmed. So the better solution should be to modify the prompt. And I tried to insert space before and after </s>, but the mismatch showed again with original code.

Liu0329 avatar Oct 25 '23 09:10 Liu0329

Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.

You may run pip install "tokenizers>=0.12.1,<0.14", and try again. Thanks.

haotian-liu avatar Oct 25 '23 17:10 haotian-liu

@yuyq96 Thanks for the fix, I'll take a look into this issue. This fix may cause issue with earlier tokenizer versions? I feel that there were some behavioral changes of the tokenizer.

haotian-liu avatar Oct 25 '23 17:10 haotian-liu

Hi, we have just set temporarily the tokenizer version to be "tokenizers>=0.12.1,<0.14" until we figure out what has changed in 0.14.

You may run pip install "tokenizers>=0.12.1,<0.14", and try again. Thanks.

Thanks, tokenizers downgrading to 0.12.1, and transformers to 4.31.0 solved the problem. I also tried inserting spaces before and after </s>, and the warning showed again, don't know why extra spaces would not do.

Liu0329 avatar Oct 26 '23 03:10 Liu0329

@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. But don't know why mismatch when set "use_fast=False"

zzzzzzrc avatar Nov 01 '23 14:11 zzzzzzrc

@haotian-liu In my experiment, tokenizer set "use_fast=True" works , with transformers==4.34.1 and tokenizers==0.14.1. But don't know why mismatch when set "use_fast=False"

@zzzzzzrc I tried to set "use_fast=True" and it works. But I'm not sure whether it will affect the final performance or not. Do you have any suggestion?

GuoQiushan avatar Nov 07 '23 08:11 GuoQiushan

Is here fixed?

xiechengmude avatar Nov 18 '23 07:11 xiechengmude

Is here fixed?

Setting use_fast=True works for my case.

GuoQiushan avatar Nov 19 '23 08:11 GuoQiushan

@haotian-liu This is similar issue as what FastChat meet. The root cause is Huggingface introduce some bugs when dealing with added tokens. Please refer the fix here.

ryusaeba avatar Dec 01 '23 01:12 ryusaeba

round_len = len(tokenizer(rou).input_ids), for each round,the tokenizer will add "bos"(bos of vicuna), so i wonder if the round_len caculation is right? Thanks

liuhaogeng avatar Dec 03 '23 04:12 liuhaogeng

I encountered the “tokenization mismatch” issue during fine-tuning as well. Upon investigation, I found that it was primarily caused by the presence of empty strings in the “value” field of QA {"from": "human", "value": ""} int the dataset. As a result, the prompt became inclusive of the string “xxx USER:ASSISTANT: xxxx”. This led to the “tokenization mismatch” issue during the tokenization process. I’m not sure if this experience is useful, but I thought I’d share it.

xxxwuwq avatar Feb 02 '24 02:02 xxxwuwq

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.

Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

lucasjinreal avatar Feb 20 '24 03:02 lucasjinreal

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work.

Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

20191864218 avatar Feb 22 '24 17:02 20191864218

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work. Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Same when using lora to finetune v1.6-34b

charismaticchiu avatar Feb 23 '24 21:02 charismaticchiu

I have fixed the issue, You just need to make sure the inputs and targets properly masked.

lucasjinreal avatar Feb 25 '24 06:02 lucasjinreal

I have fixed the issue, You just need to make sure the inputs and targets properly masked.

Can you share your tokenizer settings?

BlueBlueFF avatar Feb 26 '24 16:02 BlueBlueFF

Hi, I am training LLava with Qwen2 got same mismatch. Settting use_fast=True not work. Am just wondering will it effect the training? How to fix it for numerous tokenizers not just for llama?

Hi, I have the same issue. Have you solved it?

Same when using lora to finetune v1.6-34b

same when finetuning in 1.5b

gujiaqivadin avatar Apr 15 '24 02:04 gujiaqivadin