Arthur

Results 51 issues of Arthur

# What does this PR do? ```python from transformers import LlamaTokenizerFast, AddedToken tokenizer = LlamaTokenizerFast.from_pretrained("huggyllama/llama-7b", legacy=False, from_slow=True) tokenizer.add_tokens([AddedToken("", rstrip=True, lstrip=True)], special_tokens=False) tokenizer.tokenize("inform. Hey. .") ['', 'in', 'form', '', '.', '▁Hey',...

# What does this PR do? Draft to win more test time

# What does this PR do?

# What does this PR do? Draft for now

# What does this PR do? Update gemma

# What does this PR do? A small improvement, but overall allows us to add tokens special and not special at the same time for fast tokenizers

This revert the previous breaking change. Also add a new `ByteLevel` normalizer, which replaces the ByteLevel pre_tokenizer. Checked that we can add chines / Cyrillic tokens which are properly encoded...

Try to make our code faster :) From inital bench for GPT2: - 20% of the time is spent in the pre_tokenizer when doing batch encoding - 8% for no...

```python >>> from tokenizers import Tokenizer >>> Tokenizer.from_pretrained("ArthurZ/new-t5-base") Tokenizer(normalizer=normalizers.Sequence([normalizers.Precompiled(), normalizers.Strip(strip_left=false, strip_right=true), normalizers.Replace(pattern=Regex(" {2,}"), content="▁", regex=SysRegex { regex: Regex { raw: 0x1069ca350 } }]), pre_tokenizer=PreTokenizer(pretok=Metaspace(replacement='▁', prepend_scheme="first", split=true)), model=Unigram(vocab={'': 0, '': 0,...

# What does this PR do? EDIT: just refactor for now Enables us to run transformers model with Ragged Tensors: One of the goals is also to make it easy...

run-slow