llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Use `tokenizer.vocab_size()` instead of hardcoding 32000 when converting

Open Ronsor opened this issue 1 year ago • 0 comments

When converting the model + tokenizer, use the vocabulary size returned by the tokenizer rather than assuming 32000.

There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.

Ronsor avatar Mar 14 '23 20:03 Ronsor