Arthur comments

Results 795 comments of


                                            Arthur

error: casting `&T` to `&mut T` is undefined behavior

Pretty sure this was fixed

offline installation

you need rust to compile this from source!

Is it possible to pass a tokenizer from Python into Rust?

SOrry for the late reply here, I think pyo3 is required for this TBH. We have example of this, we use wrappers and match types to switch between `model` or...

Fix unsoundness in `tokenizers::utils::parallelism`

Hey sorry for the delay I'll try to review for next week!

Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens)

Hey! THis is most probably unrelated to `tokenizers`. Here is a good explanation: https://nlp.stanford.edu/~johnhew/vocab-expansion.html

Issue in installing rudalle on google colab, !pip install rudalle

This probably just the version of python that was not supported!

Why the tokenizer is slower than tiktoken?

It's high in my priority to do benchmarks and improve our code if needed!

Why the tokenizer is slower than tiktoken?

You are using `GPT2Tokenizer` which is the slow one. Use GPT2TokenizerFast 😅

Why the tokenizer is slower than tiktoken?

We actually dived a bit: 1. Rayon parallelism is kinda broken 2. we have concurency on the cache for GPT2 3. We have memory allocation that are also slowing down...

Why the tokenizer is slower than tiktoken?

One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs