Stephan Tulkens comments

Results 28 comments of


                                            Stephan Tulkens

Adding many AddedTokens makes loading a tokenizer extremely slow.

Here you go: https://huggingface.co/stephantulkens/large_tokenizer/tree/main

Adding many AddedTokens makes loading a tokenizer extremely slow.

Nice! In the meantime we've just added the tokens as regular tokens, which is a lot faster and also kind of works (but requires to manually edit the JSON 😆...

Adding tokens to a tokenizer with subword support?

Hey! Tokenizers generally differentiate between tokens occurring at the start of a string and in the middle of a string. In your case, the token `筹` matches only at the...

Access utf-8 byte sequence for each token

Hey, I ran into this issue, and wrote a blog post about it: https://stephantul.github.io/python/tokenizers/2023/03/16/bpe/ You can't directly take the byte representation of a token from the vocabulary. Basically, you have...

[Model Request]: Model2Vec models

Hello @joein , We're still interested in contributing this to the library. Let me know if you'd like a PR to add them. FYI: I think we can just add...

Feature request: Characters delimiter argument

Unless I misunderstand, I think this is supported. You can split by Regex or a string by using a split pretokenizer. ```python from tokenizers import Regex from tokenizers.pre_tokenizers import Split...

encode bytes directly

Hello, yes there is! You need a mapping from characters to bytes, and skip any pre-tokenization and normalization steps. ```python from tokenizers import Tokenizer def bytes_to_unicode() -> dict[int, str]: """Converts...

Extra [SEP] added for ModernBERT decoder model

Hello, I'm not one of the maintainers, but I can't seem to reproduce this. ```python from tokenizers import Tokenizer from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-17m") print(tok.encode_plus("hello 123").tokens()) # ['[CLS]',...