Mack Straight comments

Results 29 comments of


                                            Mack Straight

sentencepiece bpe compatible tokenizer

doh, thanks for pointing that out, I've only been using fp16 =) will fix.

sentencepiece bpe compatible tokenizer

The handling of UTF-8 here is exactly the same as SentencePiece does. Multi-byte characters that don't form tokens will be output one byte at a time.

sentencepiece bpe compatible tokenizer

"why not both?" - changed file magic so existing unversioned files don't misparse (ggml -> ggmf "gg model file") - now a version number in the header

We could use std::unordered_map over std::map

the token vector should prob be a struct now which also includes the score (see https://github.com/ggerganov/llama.cpp/commit/074bea2eb1f1349a0118239c4152914aecaa1be4)

segmentation fault Alpaca

this is just out of bounds write to memory_k/memory_v when n_past goes past the end, ya? if you add this assert to ggml_view_1d ` GGML_ASSERT((ne0 * GGML_TYPE_SIZE[a->type])/GGML_BLCK_SIZE[a->type]

segmentation fault Alpaca

> This looks very reasonable. The question is why we don't see a problem with llama but do with alpaca... nah it's reproducible with any model. the key difference is...

Breaking change of models since PR #252

Can you try this convert script? https://gist.github.com/eiz/828bddec6162a023114ce19146cb2b82 (it outputs .tmp files, you can uncomment the os.rename to do it in place if you want but I didn't want to overwrite...

Breaking change of models since PR #252

If you don't have access to the original LLaMA files I think someone uploaded it here https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/tokenizer.model

Breaking change of models since PR #252

the tokenizer.model contains scores for each token, most of which are just the negation of the token index (since they're output by the bpe trainer in descending order) so I...