Why doesn't this library share the same tokenizer api as the transformers library?
It would be easier to swap out the default tokenizer from the transformers library with this implementation if their apis were the same.
Which API specifically are you referring to? They are designed to be the same.
I am not which API @sabetAI refers to, but I do find the output of encode is different and this library doesn't implement the encode_plus and batch_encode_plus function
If you compare CharBPETokenizer from tokenizers and RobertaTokenizer from the transformers library, you can see the output results from encode is different, CharBPETokenizer returns an Encoding instance while RobertaTokenizer return a list of token ids.
from tokenizers.processors import RobertaProcessing
from transformers import RobertaTokenizer
char_bpe_tokenizer = CharBPETokenizer(
f"./Hant-small-{vocab_size}/vocab.json",
f"./Hant-small-{vocab_size}/merges.txt",
)
char_bpe_tokenizer._tokenizer.post_processor = RobertaProcessing(
("</s>", char_bpe_tokenizer.token_to_id("</s>")),
("<s>", char_bpe_tokenizer.token_to_id("<s>")),
)
roberta_tokenizer = RobertaTokenizer(
"./Hant-small-10000/vocab.json",
"./Hant-small-10000/merges.txt",
)
sent = "Hello world"
print(char_bpe_tokenizer.encode(sent))
print(roberta_tokenizer.encode(sent))
print(char_bpe_tokenizer.encode_plus(sent))
output:
>> Encoding(num_tokens=4, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
>> [0, 3, 3157, 2023, 3, 2725, 5467, 2]
>> AttributeError: 'CharBPETokenizer' object has no attribute 'encode_plus'
both tokenizers and transformers are build from source from github
@sabetAI A possible solution is simply wrap tokenizers inside a PreTrainedTokenizerFast from transformers and you should be able to use it as it was a PertrainedTokenizer
from tokenizers import CharBPETokenizer
from transformers.tokenization_utils import PreTrainedTokenizerFast
class GPT2TokenizerFast(PreTrainedTokenizerFast):
def __init__(
self,
vocab_file,
merges_file,
bos_token="<s>",
eos_token="</s>",
sep_token="</s>",
cls_token="<s>",
unk_token="<unk>",
pad_token="<pad>",
mask_token="<mask>",
**kwargs
):
super().__init__(
CharBPETokenizer(
vocab_file=vocab_file,
merges_file=merges_file,
),
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)
@theblackcat102 that's a straightforward and good solution.
Another incompatibility issue I've been facing is that SentencePieceBPETokenizer in tokenizers is serialized as a vocabulary and a merges file, while in transformers, the ReformerTokenizer, also based on SentencePiece, expects a binary file in the format of the Google implementation.
It would be good to have some conversion tool for these formats.
@erickrf yeah, i don't think its possible to convert from Google SentencePiece to SentencePieceBPETokenizer in tokenizers in a straight forward/elegant way. But Google SentencePiece do provide export function which might allow conversion to SentencePieceBPETokenizer.
Indeed, at the moment tokenizers does not support SentencePiece models that use the Unigram model, but only the BPE model. The support for Unigram is the next thing on the roadmap and is being tracked here: https://github.com/huggingface/tokenizers/issues/53. When this is done, there will be scripts, and everything needed to help with the conversion!
I´m also struggling with the batch_encode_plus function when training a custom tokenizer and use it in transformers library. Would be cool to have such API call.
Will try the workarount from @theblackcat102 for me.
I am not which API @sabetAI refers to, but I do find the output of encode is different and this library doesn't implement the encode_plus and batch_encode_plus function
If you compare CharBPETokenizer from tokenizers and RobertaTokenizer from the transformers library, you can see the output results from encode is different, CharBPETokenizer returns an Encoding instance while RobertaTokenizer return a list of token ids.
from tokenizers.processors import RobertaProcessing from transformers import RobertaTokenizer char_bpe_tokenizer = CharBPETokenizer( f"./Hant-small-{vocab_size}/vocab.json", f"./Hant-small-{vocab_size}/merges.txt", ) char_bpe_tokenizer._tokenizer.post_processor = RobertaProcessing( ("</s>", char_bpe_tokenizer.token_to_id("</s>")), ("<s>", char_bpe_tokenizer.token_to_id("<s>")), ) roberta_tokenizer = RobertaTokenizer( "./Hant-small-10000/vocab.json", "./Hant-small-10000/merges.txt", ) sent = "Hello world" print(char_bpe_tokenizer.encode(sent)) print(roberta_tokenizer.encode(sent)) print(char_bpe_tokenizer.encode_plus(sent))output:
>> Encoding(num_tokens=4, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]) >> [0, 3, 3157, 2023, 3, 2725, 5467, 2] >> AttributeError: 'CharBPETokenizer' object has no attribute 'encode_plus'both tokenizers and transformers are build from source from github
Hey @theblackcat102 ! Can you suggest any tricks to obtain "encoding object" from RobertaTokenizer instead of a list of token ids? I'm interested in getting all the attributes [ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing].
actually, I can not find the pad_token_id of a HuggingFace/tokenizer, and when doing max_length_truncation, why make it a attribute of a tokenizer by tokenizer.enable_truncation rather than tokenizer.encode(max_length=) since in different setting may have different max_length ?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.