tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Why doesn't this library share the same tokenizer api as the transformers library?

Open sabetAI opened this issue 5 years ago • 9 comments

It would be easier to swap out the default tokenizer from the transformers library with this implementation if their apis were the same.

sabetAI avatar May 06 '20 02:05 sabetAI

Which API specifically are you referring to? They are designed to be the same.

julien-c avatar May 06 '20 14:05 julien-c

I am not which API @sabetAI refers to, but I do find the output of encode is different and this library doesn't implement the encode_plus and batch_encode_plus function

If you compare CharBPETokenizer from tokenizers and RobertaTokenizer from the transformers library, you can see the output results from encode is different, CharBPETokenizer returns an Encoding instance while RobertaTokenizer return a list of token ids.

from tokenizers.processors import RobertaProcessing
from transformers import RobertaTokenizer

char_bpe_tokenizer = CharBPETokenizer(
    f"./Hant-small-{vocab_size}/vocab.json",
    f"./Hant-small-{vocab_size}/merges.txt",
)
char_bpe_tokenizer._tokenizer.post_processor = RobertaProcessing(
    ("</s>", char_bpe_tokenizer.token_to_id("</s>")),
    ("<s>", char_bpe_tokenizer.token_to_id("<s>")),
)
roberta_tokenizer = RobertaTokenizer(
    "./Hant-small-10000/vocab.json",
    "./Hant-small-10000/merges.txt",
)
sent = "Hello world"
print(char_bpe_tokenizer.encode(sent))
print(roberta_tokenizer.encode(sent))
print(char_bpe_tokenizer.encode_plus(sent))

output:

>> Encoding(num_tokens=4, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
>> [0, 3, 3157, 2023, 3, 2725, 5467, 2]
>> AttributeError: 'CharBPETokenizer' object has no attribute 'encode_plus'

both tokenizers and transformers are build from source from github

theblackcat102 avatar May 08 '20 03:05 theblackcat102

@sabetAI A possible solution is simply wrap tokenizers inside a PreTrainedTokenizerFast from transformers and you should be able to use it as it was a PertrainedTokenizer

from tokenizers import CharBPETokenizer
from transformers.tokenization_utils import PreTrainedTokenizerFast

class GPT2TokenizerFast(PreTrainedTokenizerFast):

    def __init__(
        self,
        vocab_file,
        merges_file,
        bos_token="<s>",
        eos_token="</s>",
        sep_token="</s>",
        cls_token="<s>",
        unk_token="<unk>",
        pad_token="<pad>",
        mask_token="<mask>",
        **kwargs
    ):
        super().__init__(
            CharBPETokenizer(
                vocab_file=vocab_file,
                merges_file=merges_file,
            ),
             bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            sep_token=sep_token,
            cls_token=cls_token,
            pad_token=pad_token,
            mask_token=mask_token,
            **kwargs,
        )

theblackcat102 avatar May 08 '20 16:05 theblackcat102

@theblackcat102 that's a straightforward and good solution.

Another incompatibility issue I've been facing is that SentencePieceBPETokenizer in tokenizers is serialized as a vocabulary and a merges file, while in transformers, the ReformerTokenizer, also based on SentencePiece, expects a binary file in the format of the Google implementation. It would be good to have some conversion tool for these formats.

erickrf avatar May 23 '20 21:05 erickrf

@erickrf yeah, i don't think its possible to convert from Google SentencePiece to SentencePieceBPETokenizer in tokenizers in a straight forward/elegant way. But Google SentencePiece do provide export function which might allow conversion to SentencePieceBPETokenizer.

theblackcat102 avatar May 26 '20 05:05 theblackcat102

Indeed, at the moment tokenizers does not support SentencePiece models that use the Unigram model, but only the BPE model. The support for Unigram is the next thing on the roadmap and is being tracked here: https://github.com/huggingface/tokenizers/issues/53. When this is done, there will be scripts, and everything needed to help with the conversion!

n1t0 avatar May 27 '20 01:05 n1t0

I´m also struggling with the batch_encode_plus function when training a custom tokenizer and use it in transformers library. Would be cool to have such API call.

Will try the workarount from @theblackcat102 for me.

miketrimmel avatar Jun 01 '20 16:06 miketrimmel

I am not which API @sabetAI refers to, but I do find the output of encode is different and this library doesn't implement the encode_plus and batch_encode_plus function

If you compare CharBPETokenizer from tokenizers and RobertaTokenizer from the transformers library, you can see the output results from encode is different, CharBPETokenizer returns an Encoding instance while RobertaTokenizer return a list of token ids.

from tokenizers.processors import RobertaProcessing
from transformers import RobertaTokenizer

char_bpe_tokenizer = CharBPETokenizer(
    f"./Hant-small-{vocab_size}/vocab.json",
    f"./Hant-small-{vocab_size}/merges.txt",
)
char_bpe_tokenizer._tokenizer.post_processor = RobertaProcessing(
    ("</s>", char_bpe_tokenizer.token_to_id("</s>")),
    ("<s>", char_bpe_tokenizer.token_to_id("<s>")),
)
roberta_tokenizer = RobertaTokenizer(
    "./Hant-small-10000/vocab.json",
    "./Hant-small-10000/merges.txt",
)
sent = "Hello world"
print(char_bpe_tokenizer.encode(sent))
print(roberta_tokenizer.encode(sent))
print(char_bpe_tokenizer.encode_plus(sent))

output:

>> Encoding(num_tokens=4, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
>> [0, 3, 3157, 2023, 3, 2725, 5467, 2]
>> AttributeError: 'CharBPETokenizer' object has no attribute 'encode_plus'

both tokenizers and transformers are build from source from github

Hey @theblackcat102 ! Can you suggest any tricks to obtain "encoding object" from RobertaTokenizer instead of a list of token ids? I'm interested in getting all the attributes [ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing].

ankushjain2001 avatar Dec 01 '20 20:12 ankushjain2001

actually, I can not find the pad_token_id of a HuggingFace/tokenizer, and when doing max_length_truncation, why make it a attribute of a tokenizer by tokenizer.enable_truncation rather than tokenizer.encode(max_length=) since in different setting may have different max_length ?

Hannibal046 avatar Jul 27 '21 14:07 Hannibal046

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 27 '24 01:05 github-actions[bot]