Nicolas Patry

Results 978 comments of Nicolas Patry

Indeed flash isn't supported on V100, and sharding requires flash for llama.

> older models? what do you mean ? All TheBlock models have this quantization configuration, no ?

Did you install `tokenizers` from source ? With `pip install -e .` ? Currently installing that way will work, but rust is in *debug* mode and not in *release* mode....

Hi @sobayed , Thanks for the example, that was helpful ! As @sebpuetz mentionned, you are actually comparing 2 **very** different algorithms. `sklearn` examples seems to be doing roughly whitespace...

Well you can have some speedups if you use `encode_batch` instead of `encode` as you can be able to use parallelization. ```python tokenizer.encode_batch([text, text, ....]) ``` But it depends on...

Hey sorry for keeping your in the dark. We've discussed this internally and I forgot to keep you up-to-date. Several things came out: - This is a very large proposal,...

> What's the issue with using `@overload`? It makes code reading much harder since you now don't know where a function starts and where it finishes. Also functions should only...

@Smu-Tan you're more than welcome to contribute it if you want. In general this library doesn't really follow `spm` architecture where normalizing and pre_tokenization is a separate step from the...

You would need to change the source code to use a network socket for NCCL. However, why not deploy on 4xA10G instead? Latency is likely to be much better. We...