cudf
cudf copied to clipboard
[POC] GPT2Tokenizer using cudf
This pull request is a Proof of Concept for GPT2Tokenizer
in the file python/cudf/cudf/core/gpt2_tokenizer.py
. The GPT2Tokenizer
class is designed to tokenize a cuDF strings column using CUDA GPT2 subword tokenizer and encode words to token ids using a pretrained tokenizer's vocabulary.
The following code is an example of how to use it
import cudf
from transformers import GPT2Tokenizer as HFGPT2Tokenizer
from cudf.core.gpt2_tokenizer import GPT2Tokenizer
#!wget https://huggingface.co/gpt2/raw/main/merges.txt
merge_pairs = cudf.read_text("merges.txt", delimiter="\n", strip_delimiters=True)
# Load the HuggingFace tokenizer primarily for the vocabulary (in future it should be self-contained)
hf_tokenizer = HFGPT2Tokenizer.from_pretrained("gpt2")
hf_tokenizer_vocab = dict(
sorted(hf_tokenizer.encoder.items(), key=lambda item: item[1])
)
input_data = cudf.Series(
[
"this is a sentence",
" this is a sentence",
"2.5 million data points",
"they've succeeded, now we'll collaborate. "
"Let's try if this works in rapids/cudf!",
# "Words like Zoë's or café don't work"
]
)
# Instantiate GPT2Tokenizer
gpt2_tokenizer = GPT2Tokenizer(cudf.Series(hf_tokenizer_vocab.keys()), merge_pairs)
out = gpt2_tokenizer(input_data)
# Now compare with huggingface output
import pandas as pd
pd.testing.assert_series_equal(
gpt2_tokenizer(input_data).to_pandas(),
pd.Series(hf_tokenizer.batch_encode_plus(input_data.to_pandas())["input_ids"]),
)
Blocker TODOs
-
Regex (i.e self.pat) doesn't match the GPT2 regex due to
- Negative Lookahead not supported i.e https://github.com/rapidsai/cudf/issues/3100
-
\p{L} \p{N}
been subsituted with\w
or\d
because to my knowledge the Regex engine doesn't support Unicode property classes
-
Explicitly encode
str
asutf-8
. (this means our code fails on any non-ascii character)- The
BPE
works onutf-8
bytes instead of unicode points. - While the underlying string column might be represented in
utf-8
we also want to operate on autf-8
, such that when call.str.translate(..)
our mapping contains abyte -> byte
mapping instead of unicode point tobyte / unicode-point
. - In python this would be done as
"".join(char for char in word.encode("utf-8"))
- The