text Bug in CLIPTokenizer input handling

🐛 Bug

Describe the bug

The output of OpenAI's CLIP tokenizer is different than Torchtext's tokenizer, when using the same inputs & settings.

To Reproduce Steps to reproduce the behavior:

Install the CLIP tokenizer from OpenAI's repo or copy the code from simple_tokenizer.py:

pip install git+https://github.com/openai/CLIP.git

Download the merge file from here: https://pytorch.s3.amazonaws.com/models/captum/clip_bpe_simple_vocab_48895.txt

I recreated the unicode setup from the bytes_to_unicode function:

from clip.simple_tokenizer import SimpleTokenizer
open_ai_tokenizer = SimpleTokenizer()

from torchtext.transforms import CLIPTokenizer as CLIPTokenizer_TorchText

# Setup test input
bpe_v = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))
bpe_vocab = [chr(c) for c in bpe_v + [256 + n for n in list(range(0, 68))]]
bpe_vocab_str = " ".join(bpe_vocab) # removing the empty space makes both outputs even more different.


txt_output_open_ai = open_ai_tokenizer.encode(bpe_vocab_str)
print(txt_output_open_ai[-50:-25])

torchtext_module = CLIPTokenizer_TorchText(merges_path="clip_bpe_simple_vocab_48895.txt")
txt_output_torchtext = torchtext_module(bpe_vocab_str)
txt_output_torchtext = [int(i) for i in txt_output_torchtext]
print(txt_output_torchtext[-50:-25])

The above code ouputs the following:

[128, 360, 128, 511, 128, 511, 128, 363, 128, 363, 328, 16384, 41901, 72, 329, 72, 329, 128, 369, 128, 369, 128, 371, 128, 371]
[128, 360, 128, 511, 128, 511, 128, 363, 128, 363, 328, 16384, 41901, 128, 367, 128, 367, 128, 369, 128, 369, 128, 371, 128, 371]

Specifically 4 of these values differ (indices -36 to -32):

[41901, 72, 329, 72, 329, 128]
[41901, 128, 367, 128, 367, 128]

Expected behavior

The outputs should be the same.

Environment

PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0
Libc version: glibc-2.26

Python version: 3.7.13 (default, Apr 24 2022, 01:04:09)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.11.0+cu113
[pip3] torchaudio==0.11.0+cu113
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0+cu113
[conda] Could not collect
torchtext version is  0.12.0

Additional context Add any other context about the problem here.

Apr 28 '22 22:04 ProGamerGov

cc @abhinavarora

Apr 28 '22 22:04 ProGamerGov

If this is a bug, it seems like it goes beyond unicode. I fed the TorchText and OpenAI CLIP tokenizers the bpe merges file as an input, and they deviate after a certain point.

with open(merges_path, "r", encoding="utf-8") as f:
    bpe_merges = f.read().split("\n")[1:]

# Add vocab merges from file
bpe_vocab = [
    "".join(merge_pair.split()) for merge_pair in bpe_merges[: len(bpe_merges)]
]
bpe_vocab_full = [s.replace("</w>", "").strip() for s in bpe_vocab]
txt_input = " ".join(bpe_vocab_full)

Indices 479 and 480 are different in the outputs, and then they completely diverge starting at index 1432.

Apr 29 '22 18:04 ProGamerGov

@ProGamerGov thanks for surfacing this issue. We will discuss with @abhinavarora to take a look at this since he is the most familiar with the BPE merge logic implementation.

@ebsmothers just wanted to ensure that you are aware of this bug in our CLIPTokenizer implementation as I believe TorchMM is using this in your model training pipeline correct?

May 03 '22 15:05 Nayef211

I was looking at only the unicode differences, and the found the following. The single character 'ĳ', is encoded by OpenAI's tokenizer with the same tokens for both an i and j resulting a two character string ij, but oddly enough given the right tokens it can correctly decode back to the single character value. This might be indicative of a bug in OpenAI's code, or some difference that TorchText missed in it's implementation.

out = torchtext_module("ĳ")
print(out) # [128,   367]

out = open_ai_tokenizer.encode("ĳ")
print(out) # [72, 329]

out = open_ai_tokenizer.decode(out)
print(out) # ij

out = open_ai_tokenizer.decode([128, 367])
print(out) # ĳ

Edit:

The use of the ftfy.fix_text function in OpenAI's CLIP Tokenizer's encode function seems to cause the above change:

def basic_clean(text):
    text = ftfy.fix_text(text)
    text = html.unescape(html.unescape(text))
    return text.strip()

Source: https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py#L123

ftfy.fix_text("ĳ") # outputs 'ij' instead of 'ĳ'

Specifically, the part that changes ĳ to ij is here: https://github.com/rspeer/python-ftfy/blob/main/ftfy/chardata.py#L216

May 06 '22 19:05 ProGamerGov

The ftfy library function is also responsible for the divergence in normal text inputs.

Looking at the docs for the library, it appears to be a heuristic replacement algorithm for for : https://ftfy.readthedocs.io/en/latest/

ftfy fixes Unicode that’s broken in various ways.

The goal of ftfy is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code.

This is different from taking in non-Unicode and outputting Unicode, which is not a goal of ftfy. It also isn’t designed to protect you from having to write Unicode-aware code. ftfy helps those who help themselves.

Of course you’re better off if your input is decoded properly and has no glitches. But you often don’t have any control over your input; it’s someone else’s mistake, but it’s your problem now. ftfy will do everything it can to fix the problem.

ftfy is a heuristic that was designed (not machine-learned) by Robyn Speer, at Luminoso.

The project has a GitHub page here: https://github.com/rspeer/python-ftfy

I'm not sure if this library function is helping or hurting the training process, but it appears to be the reason why there's a difference.

I'm not sure how the input preprocessing differs from TorchText, but this is what OpenAI uses:

from typing import List, Union

import torch
import html
import ftfy
import regex as re

@torch.jit.ignore
def openai_clip_clean(x: Union[str, List[str]]) -> Union[str, List[str]]:
    """
    Preprocess text strings as per OpenAI's standard CLIP preprocessing / cleaning.

    See here for more information:
    https://ftfy.readthedocs.io/en/latest/
    https://docs.python.org/3/library/html.html#html.unescape
    https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py

    Args:

        x (str or list of str): One or more strings to be cleaned.

    Returns:
        x (str or list of str): A list of preprocessed / cleaned strings.
    """
    is_str_input = isinstance(x, str)
    x = [x] if isinstance(x, str) else x
    assert all([isinstance(s, str) for s in x])
    for i in range(len(x)):
        # Heuristic unicode fixing (ex: mojibake)
        x[i] = ftfy.fix_text(x[i])

        # Convert named & numeric character references in HTML to unicode
        x[i] = html.unescape(html.unescape(x[i]))

        # Remove duplicate whitespaces
        x[i] = re.sub(r"\s+", " ", x[i].strip()).strip()

        # Only use lowercase characters
        x[i] = x[i].lower()
    x = x[0] if is_str_input else x
    return x


text = openai_clip_clean(text)

May 06 '22 20:05 ProGamerGov

@abhinavarora would you mind taking a look at this?

May 19 '22 17:05 Nayef211

text text copied to clipboard

Bug in CLIPTokenizer input handling

🐛 Bug

text
text copied to clipboard