FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Extraneous newlines in lmsys/fastchat-t5-3b-v1.0 tokenizer

Open bradfox2 opened this issue 2 years ago • 2 comments

Vicuna tokenizer has no extra '\n' characters. T5 tokenizer inserts them after each space.

Reproduce:

from transformers import (T5TokenizerFast, T5ForConditionalGeneration,  AutoTokenizer,LlamaTokenizer)
t = T5TokenizerFast.from_pretrained('lmsys/fastchat-t5-3b-v1.0')
text = 'I am a dog and i dont like cats'
t(text)
t.decode(t(text)['input_ids'])
t2 = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-delta-v0')
t2.decode(t2.encode(text))

From T5: 'I\n am\n a\n dog\n and\n i\n dont\n like\n cats' Vicuna-Llama : 'I am a dog and i dont like cats'

This seems like a bug. Is there a purpose for this?

bradfox2 avatar May 08 '23 04:05 bradfox2

@DachengLi1 can explain this better. I guess you can use this argument in decode https://github.com/lm-sys/FastChat/blob/ea6c7b6da47d15d6e3264d0abba7b8d1090479a4/fastchat/serve/huggingface_api.py#L46

merrymercy avatar May 08 '23 08:05 merrymercy

Thanks for the response. More concerned about training in a bunch of newlines when using the provided tokenizer.

Removing intermediate newlines from the output - or simply using the flan series of tokenizers works fine for decoding/inference.

bradfox2 avatar May 08 '23 14:05 bradfox2

Good point on the T5Tokenizer!

Firstly, we use the T5Tokenizer instead of T5TokenizerFast, and there is a difference (a HF discussion thread). If we use T5Tokenizer, and encode the sentence, we will find:

Screenshot 2023-05-09 at 11 19 10 AM

where 32106 is actually the whitespace, instead of newlines. Screenshot 2023-05-09 at 11 23 01 AM

Lastly, we use T5Tokenizer to support decoding with special tokens. And @merrymercy is totally right, we have to add spaces_between_special_tokens=False to do this decoding. In particular, we treat whitespaces as a special token because the sentencepiece for T5 will treat consecutive whitespaces as a single one. This is not ideal if we want to output texts that are indent sensitive (e.g. codes). There is an issue on this.

@bradfox2 Let me know if there is any further question!

DachengLi1 avatar May 09 '23 18:05 DachengLi1

@DachengLi1 Thank you for the answer. I was not aware of that difference in standard vs Fast. Makes sense now.

bradfox2 avatar May 09 '23 19:05 bradfox2