CodeT5 icon indicating copy to clipboard operation
CodeT5 copied to clipboard

Tokenizer: setting lstrip to False for special tokens

Open JoaoLages opened this issue 2 years ago • 5 comments

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5p-220m-py")

code = """
    # this is a code comment
    <extra_id_0>
"""

print(tokenizer.decode(tokenizer(aux)["input_ids"]))

output:

<s>
    # this is a code comment<extra_id_0>
</s>

It seems that \t\n is not being encoded (or decoded) properly :(

JoaoLages avatar Sep 01 '23 10:09 JoaoLages

I just found out that \n and \t have the exact same token id 😐

tokenizer.convert_tokens_to_ids(["\n", "\t"])
Out[35]: [3, 3]

Edit: yes, they are both the UNK id

tokenizer.unk_token_id
Out[39]: 3

JoaoLages avatar Sep 01 '23 10:09 JoaoLages

It seems that the problem is with \n and \t before the special tokens:

aux
Out[58]: '\t\n# this is a code comment\n\t<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[59]: '<s>\t\n# this is a code comment<extra_id_0></s>'
aux
Out[62]: '\n# this is a code comment\n<extra_id_0>'
tokenizer.decode(tokenizer(aux)["input_ids"], skip_special_tokens=False)
Out[63]: '<s>\n# this is a code comment<extra_id_0></s>'

JoaoLages avatar Sep 01 '23 10:09 JoaoLages

This is happening because all <extra_id_*> tokens have lstrip set to True. Any reason for this decision?

fillassuncao avatar Sep 01 '23 13:09 fillassuncao

This is happening because all <extra_id_*> tokens have lstrip set to True. Any reason for this decision?

Indeed, this makes things work:

tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken(at.content, rstrip=False, lstrip=False, single_word=False, normalized=True) for at in tokenizer.special_tokens_map_extended['additional_special_tokens']]}, replace_additional_special_tokens=True)

JoaoLages avatar Sep 01 '23 14:09 JoaoLages

Hi both, thanks for identifying the issue and providing the solution! We did not intentionally to have lstrip set to True.

yuewang-cuhk avatar Sep 05 '23 04:09 yuewang-cuhk