Arthur
Arthur
# What does this PR do? ```python from transformers import LlamaTokenizerFast, AddedToken tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=False, from_slow=True) tokenizer.add_tokens([AddedToken("", rstrip=True, lstrip=True)], special_tokens=False) tokenizer.tokenize("inform. Hey. .") ['', 'in', 'form', '', '.', '▁Hey',...
# What does this PR do? Draft to win more test time
# What does this PR do?
# What does this PR do? Draft for now
# What does this PR do? Update gemma
# What does this PR do? A small improvement, but overall allows us to add tokens special and not special at the same time for fast tokenizers
This revert the previous breaking change. Also add a new `ByteLevel` normalizer, which replaces the ByteLevel pre_tokenizer. Checked that we can add chines / Cyrillic tokens which are properly encoded...
Try to make our code faster :) From inital bench for GPT2: - 20% of the time is spent in the pre_tokenizer when doing batch encoding - 8% for no...
```python >>> from tokenizers import Tokenizer >>> Tokenizer.from_pretrained("ArthurZ/new-t5-base") Tokenizer(normalizer=normalizers.Sequence([normalizers.Precompiled(), normalizers.Strip(strip_left=false, strip_right=true), normalizers.Replace(pattern=Regex(" {2,}"), content="▁", regex=SysRegex { regex: Regex { raw: 0x1069ca350 } }]), pre_tokenizer=PreTokenizer(pretok=Metaspace(replacement='▁', prepend_scheme="first", split=true)), model=Unigram(vocab={'': 0, '': 0,...
# What does this PR do? EDIT: just refactor for now Enables us to run transformers model with Ragged Tensors: One of the goals is also to make it easy...