goerch comments

Results 62 comments of


                                            goerch

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

That is a nice test. I made some modifications to get more detailed outputs of the tests and see differences like 1. Problem with `endoftext` ![image](https://github.com/ggerganov/llama.cpp/assets/3709434/291e9c93-4be7-40a5-a181-9d53c1d17cbb) 2. Non greediness ![image](https://github.com/ggerganov/llama.cpp/assets/3709434/99e749fe-14fd-466a-9908-01d5c9eb756d)...

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

Intermediate results of debugging: `bpe_gpt2_preprocess` seems to do the right thing, `llm_tokenizer_bpe::tokenize` seems to be subtly broken, although it looks very similar to `examples/gptneox-wip`. Paging @cmp-nct in need for help,...

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

> `llm_tokenizer_bpe::tokenize` seems to be subtly broken I implemented an independent port of the [gpt2-tokenizer](https://github.com/openai/gpt-2/blob/master/src/encoder.py#L55-L101)(will share the code if someone is interested) and it shows the same behavior as the...

goerch

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization

Surprising runtime behaviour with objective types

Tanh is not implemented

Crashes when using @threads with intersection functions

Crashes when using @threads with intersection functions

Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries

Fix for windows.