CLIP icon indicating copy to clipboard operation
CLIP copied to clipboard

Possible inconsistencies arising from the pattern matching of CLIP tokenizer

Open Mypathissional opened this issue 1 year ago • 0 comments

Hi there,

I have been experimenting with the CLIP tokenizer and have observed that the tokenizer produces identical outputs for the following cases:

1). "1233" and "12 33" 2) "'Medicare For All" and "'M edicare For All"

In both cases, the tokenizer applies a pattern matching step using the regular expression re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", re.IGNORECASE) before applying the BPE algorithm. This pattern matches the same characters for both cases (1) and (2), which seems to be a logical inconsistency, particularly for case (1).

I would suggest modifying the pattern to re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s |'t |'re |'ve |'m |'ll |'d |[\p{L}]+|[\p{N}]+|[^\s\p{L}\p{N}]+""", re.IGNORECASE) to resolve this issue. This modified pattern should ensure that the tokenizer correctly handles cases where input text includes whitespace characters between numbers and preserves the integrity of the original input.

Best Regards, Maria

Mypathissional avatar Apr 03 '23 11:04 Mypathissional