Stephan Tulkens

Results 28 comments of Stephan Tulkens

Here you go: https://huggingface.co/stephantulkens/large_tokenizer/tree/main

Nice! In the meantime we've just added the tokens as regular tokens, which is a lot faster and also kind of works (but requires to manually edit the JSON 😆...

Hey! Tokenizers generally differentiate between tokens occurring at the start of a string and in the middle of a string. In your case, the token `ç­¹` matches only at the...

Hey, I ran into this issue, and wrote a blog post about it: https://stephantul.github.io/python/tokenizers/2023/03/16/bpe/ You can't directly take the byte representation of a token from the vocabulary. Basically, you have...

Hello @joein , We're still interested in contributing this to the library. Let me know if you'd like a PR to add them. FYI: I think we can just add...

Unless I misunderstand, I think this is supported. You can split by Regex or a string by using a split pretokenizer. ```python from tokenizers import Regex from tokenizers.pre_tokenizers import Split...

Hello, yes there is! You need a mapping from characters to bytes, and skip any pre-tokenization and normalization steps. ```python from tokenizers import Tokenizer def bytes_to_unicode() -> dict[int, str]: """Converts...

Hello, I'm not one of the maintainers, but I can't seem to reproduce this. ```python from tokenizers import Tokenizer from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("jhu-clsp/ettin-encoder-17m") print(tok.encode_plus("hello 123").tokens()) # ['[CLS]',...