cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[POC] GPT2Tokenizer using cudf

Open praateekmahajan opened this issue 1 year ago • 2 comments

This pull request is a Proof of Concept for GPT2Tokenizer in the file python/cudf/cudf/core/gpt2_tokenizer.py. The GPT2Tokenizer class is designed to tokenize a cuDF strings column using CUDA GPT2 subword tokenizer and encode words to token ids using a pretrained tokenizer's vocabulary.

The following code is an example of how to use it

import cudf
from transformers import GPT2Tokenizer as HFGPT2Tokenizer
from cudf.core.gpt2_tokenizer import GPT2Tokenizer

#!wget https://huggingface.co/gpt2/raw/main/merges.txt
merge_pairs = cudf.read_text("merges.txt", delimiter="\n", strip_delimiters=True)

# Load the HuggingFace tokenizer primarily for the vocabulary (in future it should be self-contained)
hf_tokenizer = HFGPT2Tokenizer.from_pretrained("gpt2")
hf_tokenizer_vocab = dict(
    sorted(hf_tokenizer.encoder.items(), key=lambda item: item[1])
)

input_data = cudf.Series(
    [
        "this is a sentence",
        " this is a sentence",
        "2.5 million data points",
        "they've succeeded, now we'll collaborate. "
        "Let's try if this works in rapids/cudf!",
      # "Words like Zoë's or café don't work"
    ]
)



# Instantiate GPT2Tokenizer
gpt2_tokenizer = GPT2Tokenizer(cudf.Series(hf_tokenizer_vocab.keys()), merge_pairs)
out = gpt2_tokenizer(input_data)



# Now compare with huggingface output
import pandas as pd

pd.testing.assert_series_equal(
    gpt2_tokenizer(input_data).to_pandas(),
    pd.Series(hf_tokenizer.batch_encode_plus(input_data.to_pandas())["input_ids"]),
)


Blocker TODOs

  1. Regex (i.e self.pat) doesn't match the GPT2 regex due to

    • Negative Lookahead not supported i.e https://github.com/rapidsai/cudf/issues/3100
    • \p{L} \p{N} been subsituted with \w or \d because to my knowledge the Regex engine doesn't support Unicode property classes
  2. Explicitly encode str as utf-8. (this means our code fails on any non-ascii character)

    • The BPE works on utf-8 bytes instead of unicode points.
    • While the underlying string column might be represented in utf-8 we also want to operate on a utf-8, such that when call .str.translate(..) our mapping contains a byte -> byte mapping instead of unicode point to byte / unicode-point.
    • In python this would be done as "".join(char for char in word.encode("utf-8"))

praateekmahajan avatar Mar 23 '24 07:03 praateekmahajan

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Mar 23 '24 07:03 copy-pr-bot[bot]

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Closing for now. We can reopen when we have the bandwidth to work on this tokenizer again.

vyasr avatar May 19 '25 23:05 vyasr