Diego Devesa comments

Results 361 comments of


                                            Diego Devesa

Differences with the llama tokenizer

Same result using the current master and reconverting the model. More interestingly, the llama tokenizer seems to produce different results for single tokens than from groups of tokens. For example:...

Differences with the llama tokenizer

It looks like SentencePiece [has an option](https://github.com/google/sentencepiece/blob/master/doc/options.md) `--add_dummy_prefix` which adds a dummy whitespace at the beginning of text, so that may well explain it.

Differences with the llama tokenizer

Extracted these options from the tokenizer model protobuf: ``` trainer_spec { input: "/large_experiments/theorem/datasets/MERGED/all.test1.merged" model_prefix: "spm_model_32k_200M_charcov099995_allowWSO__v2" model_type: BPE vocab_size: 32000 self_test_sample_size: 0 input_format: "text" character_coverage: 0.99995 input_sentence_size: 200000000 seed_sentencepiece_size: 1000000 shrinking_factor:...

Differences with the llama tokenizer

The recently merged #242 still isn't accurate, for example: ``` llama.cpp: 1 -> '' 29871 -> ' ' 7346 -> '########' 13383 -> '################' 13 -> ' ' llama: 1...

Differences with the llama tokenizer

Fixed in #252

Supported context window length for each model?

According to [this](https://github.com/facebookresearch/llama/issues/16), LLaMA has a context window of 2048.

sentencepiece bpe compatible tokenizer

Breaks quantize.cpp currently, needs to update the tokenizer part to add the score.

sentencepiece bpe compatible tokenizer

The tokenization look great, I couldn't find any differences with the original llama tokenizer.

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode

The model was (presumably) trained to ignore everything before the eos token. Token 13 is \n so you are replacing the end of text token with a new line, so...

Replace EOS with newline to prevent context/memory being flushed by EOS in interactive mode

To find the token id dynamically you could do something like this in main, after the call to llama_model_load and before the main loop: ```c++ const auto newline_token_id = vocab.token_to_id["\n"];...