tokenizers Cannot inject custom PreTokenizer into Tokenizer

Hey,

I want to train a Tokenizer that operates on a custom PreTokenizer. I tried a mix of this documentation post and this example. My resulting code looks like this:

class GlyLESPreTokenizer:
    def __init__(self, *args, **kwargs):
        pass

    def __new__(cls, *args, **kwargs):
        return super().__new__(cls)
    
    def glyles_split(self, iupac: str):
        iuapc = iupac.strip().replace(" ", "")
        token = CommonTokenStream(GlyLESLexer(InputStream(data="{" + iupac + "}")))
        GlyLESParser(token).start()
        idx = 0
        output = []
        for i in range(1, len(token.tokens) - 2):
            txt = str(token.tokens[i].text)
            output.append((txt, (idx, idx + len(txt))))
            idx += len(txt)
        return output
        
    def pre_tokenize_str(self, input_: str):
        return self.glyles_split(input_)


iupac = "QuiNAlaAc(b1-4)GalNAcA(a1-4)GalOAc(a1-2)QuiNAlaAc"

# This returns a list of 33 token
GlyLESPreTokenizer().pre_tokenize_str(iupac)

# This however only returns a list with one token that is the entire input string.
pre_tokenizers.PreTokenizer.custom(GlyLESPreTokenizer()).pre_tokenize_str(iupac)

The final idea is to use it in such setting:

tokenizer = Tokenizer(models.Model())
tokenizer.normalizer = normalizers.Strip()
tokenizer.pre_tokenizer = pre_tokenizers.PreTokenizer.custom(GlyLESPreTokenizer())

Can someone help me to understand how to use the pre_tokenizers.PreTokenizer.custom method to inject a custom, python-written PreTokenizer into a Tokenizer? Unfortunately, it is far beyond the scope of the project to convert the logic from GlyLES to RUST, so it has to be a Python PreTokenizer-class that is somehow injected into the Tokenizer.

Thank you for any help, comment, or feedback in advance. Roman

Sep 23 '24 19:09 Old-Shatterhand

Hello, I don't think there is a neat solution for this particular issue, but you can bypass it by using pre_tokenized=True in the call to encode, and just pretokenizing beforehand.

tokenizer = Tokenizer(models.Model())
tokenizer.normalizer = normalizers.Strip()
pretokenizer = GlyLESPreTokenizer()

iupac = "QuiNAlaAc(b1-4)GalNAcA(a1-4)GalOAc(a1-2)QuiNAlaAc"
tokens = tokenizer.encode(pretokenizer.pretokenize_str(iupac), pre_tokenized=True)

I would just wrap the whole thing in a custom class, and then just use it like that.

class MyTokenizer:

    # boilerplate etc goes above.

    def encode(self, string: str, *args, **kwargs) -> Encoding:
        tokens = self._tokenizer.encode(self._pretokenizer.pretokenize_str(string), pre_tokenized=True)

I hope this helps!

Sep 24 '24 07:09 stephantul

Thank you, that works. There is only a small correction: it's is_pretokenized=True in the tokenizer.encode(...) call.

Sep 24 '24 09:09 Old-Shatterhand

This solved the problem of using a custom PreTokenizer in the tokenizers interface. But I still cannot train BPE (or other models) on it. Isn't there a way to get the PreTokenizer.custom(...) method to work?

Sep 25 '24 11:09 Old-Shatterhand

Could you share a snippet with training and log of what went wrong? 🤗

Sep 26 '24 16:09 ArthurZucker

Hey

After my initial post, it's not a single point that breaks. It's more conceptual, and I don't see if or how it is possible in the current implementation. In the initial post, I couldn't inject my plain PreTokenizer with the PreTokenizer.custom method for diverse reasons.

The idea posted above is the only way to use my pre-tokenizer inside the tokenizers package. Everything else didn't work. So, I currently have that class holding my pre-tokenizer and a plain tokenizers.Tokenizer. All calls to that class (except for encode and encode_batch) are piped to the tokenizer.

But if I want to train a BPE tokenizer in the above implementation, I'd train the wrapped tokenizer. Because I call train_from_iterator in the inner tokenizer, my pre_tokenizer is ignored. I cannot give a pre_tokenized dataset to the inner iterator. It would be seen as a list of words, not a list of tokens. What I'm currently doing is:

def tokenize_function(iupac) -> list:
    iupac = iupac.strip().replace(" ", "")
    token = CommonTokenStream(GIFFLARLexer(InputStream(data="{" + iupac + "}")))
    GIFFLARParser(token).start()
    return [t.text for t in token.tokens[1:-2]]

dataset = load_dataset("text", data_files={"train": "glycans_1000.txt"})
t = Tokenizer(BPE())
t.normalizer = normalizers.Strip()
t.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = BpeTrainer(special_tokens=["[PAD]", "[CLS]", "[SEP]", "[MASK]", "[UNK]"], vocab_size=300)
t.train_from_iterator([tokenize_function(d["text"]) for d in dataset["train"]], trainer)

But the tokenizer learns things from chars and not from my pre-calculated tokens.

I'm thankful for any help.

Sep 27 '24 00:09 Old-Shatterhand

In the final line, you push list[list[str]], but I think the iterator expects list[str]. I think you can achieve what you want by letting tokenize_function return a string: " ".join([t.text for t in token.tokens[1:-2]]. That way you pretokenize first, then join on whitespace.

The Whitespace pretokenizer in the tokenizers module splits on whitespace, so you basically achieve what you would like to achieve, right? One issue I could see is that this would break if your tokens also contain spaces, but I'm not sure if that is the case.

Sep 27 '24 07:09 stephantul