Diego Devesa
Diego Devesa
Same result using the current master and reconverting the model. More interestingly, the llama tokenizer seems to produce different results for single tokens than from groups of tokens. For example:...
It looks like SentencePiece [has an option](https://github.com/google/sentencepiece/blob/master/doc/options.md) `--add_dummy_prefix` which adds a dummy whitespace at the beginning of text, so that may well explain it.
Extracted these options from the tokenizer model protobuf: ``` trainer_spec { input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged" model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2" model_type: BPE vocab_size: 32000 self_test_sample_size: 0 input_format: "text" character_coverage: 0.99995 input_sentence_size: 200000000 seed_sentencepiece_size: 1000000 shrinking_factor:...
The recently merged #242 still isn't accurate, for example: ``` llama.cpp: 1 -> '' 29871 -> ' ' 7346 -> '########' 13383 -> '################' 13 -> ' ' llama: 1...
Fixed in #252
According to [this](https://github.com/facebookresearch/llama/issues/16), LLaMA has a context window of 2048.
Breaks quantize.cpp currently, needs to update the tokenizer part to add the score.
The tokenization look great, I couldn't find any differences with the original llama tokenizer.
The model was (presumably) trained to ignore everything before the eos token. Token 13 is \n so you are replacing the end of text token with a new line, so...
To find the token id dynamically you could do something like this in main, after the call to llama_model_load and before the main loop: ```c++ const auto newline_token_id = vocab.token_to_id["\n"];...