Arthur comments

Results 795 comments of


                                            Arthur

Rust documentation

Let's close it 😉

thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22: AddedVocabulary bad split

Hey all! sorry I'll have a look!

thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22: AddedVocabulary bad split

I am pretty sure that `32106: AddedToken(" ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),` is an issue: ```python tokenizer.encode("hey .") ``` will produce this issue

thread '<unnamed>' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:428:22: AddedVocabulary bad split

If I do `AutoTokenizer.from_pretrained("path-to-model", added_tokens_decoder=None)` then this is no longer the case

added_tokens with bytemap charaters in ByteLevel could not be decoded correctly

Re-opening as the merge on main will be reverted for a better fix soon

is there a guidance to adapt tokenizers to c++ project?

Hey! It does not seem to be asked that much unfortunately and would be a loooot of efforts on our side. You do have unofficial C bindings out there I...

[Potential Bug] Mistral Tokenizer Inconsistencies

This was fixed in `transformers` you need to set `legacy=False` 🤗

Added ability to inspect a 'Sequence' decoder and the `AddedVocabulary`.

Thanks for your contribution 🤗

Ability to re-train a Tokenizer with relevant parameters

This issue is more a feature request than a `problem`. You are doing something wrong as the error indicates: pretty sure the special tokens are missing in the `tokenizer` while...

Support for openai Whisper

Hey! Pretty sure it is available in `peft` see this [notebook](https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb) and this [discussion](https://github.com/openai/whisper/discussions/988)