cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[POC] GPT2Tokenizer using cudf

Open praateekmahajan opened this issue 11 months ago • 2 comments

This pull request is a Proof of Concept for GPT2Tokenizer in the file python/cudf/cudf/core/gpt2_tokenizer.py. The GPT2Tokenizer class is designed to tokenize a cuDF strings column using CUDA GPT2 subword tokenizer and encode words to token ids using a pretrained tokenizer's vocabulary.

The following code is an example of how to use it

import cudf
from transformers import GPT2Tokenizer as HFGPT2Tokenizer
from cudf.core.gpt2_tokenizer import GPT2Tokenizer

#!wget https://huggingface.co/gpt2/raw/main/merges.txt
merge_pairs = cudf.read_text("merges.txt", delimiter="\n", strip_delimiters=True)

# Load the HuggingFace tokenizer primarily for the vocabulary (in future it should be self-contained)
hf_tokenizer = HFGPT2Tokenizer.from_pretrained("gpt2")
hf_tokenizer_vocab = dict(
    sorted(hf_tokenizer.encoder.items(), key=lambda item: item[1])
)

input_data = cudf.Series(
    [
        "this is a sentence",
        " this is a sentence",
        "2.5 million data points",
        "they've succeeded, now we'll collaborate. "
        "Let's try if this works in rapids/cudf!",
      # "Words like Zoë's or café don't work"
    ]
)



# Instantiate GPT2Tokenizer
gpt2_tokenizer = GPT2Tokenizer(cudf.Series(hf_tokenizer_vocab.keys()), merge_pairs)
out = gpt2_tokenizer(input_data)



# Now compare with huggingface output
import pandas as pd

pd.testing.assert_series_equal(
    gpt2_tokenizer(input_data).to_pandas(),
    pd.Series(hf_tokenizer.batch_encode_plus(input_data.to_pandas())["input_ids"]),
)


Blocker TODOs

  1. Regex (i.e self.pat) doesn't match the GPT2 regex due to

    • Negative Lookahead not supported i.e https://github.com/rapidsai/cudf/issues/3100
    • \p{L} \p{N} been subsituted with \w or \d because to my knowledge the Regex engine doesn't support Unicode property classes
  2. Explicitly encode str as utf-8. (this means our code fails on any non-ascii character)

    • The BPE works on utf-8 bytes instead of unicode points.
    • While the underlying string column might be represented in utf-8 we also want to operate on a utf-8, such that when call .str.translate(..) our mapping contains a byte -> byte mapping instead of unicode point to byte / unicode-point.
    • In python this would be done as "".join(char for char in word.encode("utf-8"))

praateekmahajan avatar Mar 23 '24 07:03 praateekmahajan