[POC] GPT2Tokenizer using cudf
This pull request is a Proof of Concept for GPT2Tokenizer in the file python/cudf/cudf/core/gpt2_tokenizer.py. The GPT2Tokenizer class is designed to tokenize a cuDF strings column using CUDA GPT2 subword tokenizer and encode words to token ids using a pretrained tokenizer's vocabulary.
The following code is an example of how to use it
import cudf
from transformers import GPT2Tokenizer as HFGPT2Tokenizer
from cudf.core.gpt2_tokenizer import GPT2Tokenizer
#!wget https://huggingface.co/gpt2/raw/main/merges.txt
merge_pairs = cudf.read_text("merges.txt", delimiter="\n", strip_delimiters=True)
# Load the HuggingFace tokenizer primarily for the vocabulary (in future it should be self-contained)
hf_tokenizer = HFGPT2Tokenizer.from_pretrained("gpt2")
hf_tokenizer_vocab = dict(
sorted(hf_tokenizer.encoder.items(), key=lambda item: item[1])
)
input_data = cudf.Series(
[
"this is a sentence",
" this is a sentence",
"2.5 million data points",
"they've succeeded, now we'll collaborate. "
"Let's try if this works in rapids/cudf!",
# "Words like Zoë's or café don't work"
]
)
# Instantiate GPT2Tokenizer
gpt2_tokenizer = GPT2Tokenizer(cudf.Series(hf_tokenizer_vocab.keys()), merge_pairs)
out = gpt2_tokenizer(input_data)
# Now compare with huggingface output
import pandas as pd
pd.testing.assert_series_equal(
gpt2_tokenizer(input_data).to_pandas(),
pd.Series(hf_tokenizer.batch_encode_plus(input_data.to_pandas())["input_ids"]),
)
Blocker TODOs
-
Regex (i.e self.pat) doesn't match the GPT2 regex due to
- Negative Lookahead not supported i.e https://github.com/rapidsai/cudf/issues/3100
\p{L} \p{N}been subsituted with\wor\dbecause to my knowledge the Regex engine doesn't support Unicode property classes
-
Explicitly encode
strasutf-8. (this means our code fails on any non-ascii character)- The
BPEworks onutf-8bytes instead of unicode points. - While the underlying string column might be represented in
utf-8we also want to operate on autf-8, such that when call.str.translate(..)our mapping contains abyte -> bytemapping instead of unicode point tobyte / unicode-point. - In python this would be done as
"".join(char for char in word.encode("utf-8"))
- The
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Closing for now. We can reopen when we have the bandwidth to work on this tokenizer again.