tokenizers Why doesn't this library share the same tokenizer api as the transformers library?

It would be easier to swap out the default tokenizer from the transformers library with this implementation if their apis were the same.

May 06 '20 02:05 sabetAI

Which API specifically are you referring to? They are designed to be the same.

May 06 '20 14:05 julien-c

I am not which API @sabetAI refers to, but I do find the output of encode is different and this library doesn't implement the encode_plus and batch_encode_plus function

If you compare CharBPETokenizer from tokenizers and RobertaTokenizer from the transformers library, you can see the output results from encode is different, CharBPETokenizer returns an Encoding instance while RobertaTokenizer return a list of token ids.

from tokenizers.processors import RobertaProcessing
from transformers import RobertaTokenizer

char_bpe_tokenizer = CharBPETokenizer(
    f"./Hant-small-{vocab_size}/vocab.json",
    f"./Hant-small-{vocab_size}/merges.txt",
)
char_bpe_tokenizer._tokenizer.post_processor = RobertaProcessing(
    ("</s>", char_bpe_tokenizer.token_to_id("</s>")),
    ("<s>", char_bpe_tokenizer.token_to_id("<s>")),
)
roberta_tokenizer = RobertaTokenizer(
    "./Hant-small-10000/vocab.json",
    "./Hant-small-10000/merges.txt",
)
sent = "Hello world"
print(char_bpe_tokenizer.encode(sent))
print(roberta_tokenizer.encode(sent))
print(char_bpe_tokenizer.encode_plus(sent))

output:

>> Encoding(num_tokens=4, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
>> [0, 3, 3157, 2023, 3, 2725, 5467, 2]
>> AttributeError: 'CharBPETokenizer' object has no attribute 'encode_plus'

both tokenizers and transformers are build from source from github

May 08 '20 03:05 theblackcat102

@sabetAI A possible solution is simply wrap tokenizers inside a PreTrainedTokenizerFast from transformers and you should be able to use it as it was a PertrainedTokenizer

from tokenizers import CharBPETokenizer
from transformers.tokenization_utils import PreTrainedTokenizerFast

class GPT2TokenizerFast(PreTrainedTokenizerFast):

    def __init__(
        self,
        vocab_file,
        merges_file,
        bos_token="<s>",
        eos_token="</s>",
        sep_token="</s>",
        cls_token="<s>",
        unk_token="<unk>",
        pad_token="<pad>",
        mask_token="<mask>",
        **kwargs
    ):
        super().__init__(
            CharBPETokenizer(
                vocab_file=vocab_file,
                merges_file=merges_file,
            ),
             bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            sep_token=sep_token,
            cls_token=cls_token,
            pad_token=pad_token,
            mask_token=mask_token,
            **kwargs,
        )

May 08 '20 16:05 theblackcat102

@theblackcat102 that's a straightforward and good solution.

Another incompatibility issue I've been facing is that SentencePieceBPETokenizer in tokenizers is serialized as a vocabulary and a merges file, while in transformers, the ReformerTokenizer, also based on SentencePiece, expects a binary file in the format of the Google implementation. It would be good to have some conversion tool for these formats.

May 23 '20 21:05 erickrf

@erickrf yeah, i don't think its possible to convert from Google SentencePiece to SentencePieceBPETokenizer in tokenizers in a straight forward/elegant way. But Google SentencePiece do provide export function which might allow conversion to SentencePieceBPETokenizer.

May 26 '20 05:05 theblackcat102

Indeed, at the moment tokenizers does not support SentencePiece models that use the Unigram model, but only the BPE model. The support for Unigram is the next thing on the roadmap and is being tracked here: https://github.com/huggingface/tokenizers/issues/53. When this is done, there will be scripts, and everything needed to help with the conversion!

May 27 '20 01:05 n1t0

I´m also struggling with the batch_encode_plus function when training a custom tokenizer and use it in transformers library. Would be cool to have such API call.

Will try the workarount from @theblackcat102 for me.

Jun 01 '20 16:06 miketrimmel

I am not which API @sabetAI refers to, but I do find the output of encode is different and this library doesn't implement the encode_plus and batch_encode_plus function

If you compare CharBPETokenizer from tokenizers and RobertaTokenizer from the transformers library, you can see the output results from encode is different, CharBPETokenizer returns an Encoding instance while RobertaTokenizer return a list of token ids.
from tokenizers.processors import RobertaProcessing
from transformers import RobertaTokenizer

char_bpe_tokenizer = CharBPETokenizer(
    f"./Hant-small-{vocab_size}/vocab.json",
    f"./Hant-small-{vocab_size}/merges.txt",
)
char_bpe_tokenizer._tokenizer.post_processor = RobertaProcessing(
    ("</s>", char_bpe_tokenizer.token_to_id("</s>")),
    ("<s>", char_bpe_tokenizer.token_to_id("<s>")),
)
roberta_tokenizer = RobertaTokenizer(
    "./Hant-small-10000/vocab.json",
    "./Hant-small-10000/merges.txt",
)
sent = "Hello world"
print(char_bpe_tokenizer.encode(sent))
print(roberta_tokenizer.encode(sent))
print(char_bpe_tokenizer.encode_plus(sent))
output:
>> Encoding(num_tokens=4, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
>> [0, 3, 3157, 2023, 3, 2725, 5467, 2]
>> AttributeError: 'CharBPETokenizer' object has no attribute 'encode_plus'
both tokenizers and transformers are build from source from github

Hey @theblackcat102 ! Can you suggest any tricks to obtain "encoding object" from RobertaTokenizer instead of a list of token ids? I'm interested in getting all the attributes [ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing].

Dec 01 '20 20:12 ankushjain2001

actually, I can not find the pad_token_id of a HuggingFace/tokenizer, and when doing max_length_truncation, why make it a attribute of a tokenizer by tokenizer.enable_truncation rather than tokenizer.encode(max_length=) since in different setting may have different max_length ?

Jul 27 '21 14:07 Hannibal046

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 27 '24 01:05 github-actions[bot]