jarcen

Results 13 comments of


                                            jarcen

Reverse prompt is sometimes ignored.

I mentioned it in another issue: llama was trained with tokenizer augmentations, means the tokenizer occasionally did sub-optimal word partitioning at training time: https://github.com/google/sentencepiece#subword-regularization-and-bpe-dropout It is claimed it improves generalization....

Scale buf_size linearly with n_ctx

Throwing some ideas about actual reasons behind the bug. I think it's the classic integer division gotcha: https://github.com/ggerganov/llama.cpp/blob/8cf9f34eddc124d4ab28f4d2fe8e99d574510bde/main.cpp#L757-L758 if batch size 'N' > 1 then there will be loss of...

Scale buf_size linearly with n_ctx

Looking further, it also slowly creeps up as prompt being read(batch size = 4) ``` Used mem: 59375120, predicted 57474576 Used mem: 59440656, predicted 57474576 Used mem: 59506192, predicted 57474576...

Scale buf_size linearly with n_ctx

Right, batch processing at least must construct `ggml_diag_mask_inf` for masked attention so each token in batch can attend not only to past memory but also to it's neighbors in the...

FP16 and 4-bit quantized model both produce garbage output on M1 8GB

I started messing with this project two hours ago and had exactly same issue. Completely mangled output. Turns out for me the problem was that I compiled it with Cygwin....

Differences with the llama tokenizer

My observations: Token 4013 = 'This'. Token 910 = ' This'. Token 10994 = 'Hello'. Token 15043 = ' Hello'. Notice whitespace. They're different. Don't know why python library doesn't...

Differences with the llama tokenizer

I don't expect it will. I read sentencepiece frontpage documentation and it says it uses regularization at training time. Basically, it randomly creates suboptimal tokenized strings to improve robustness. It...

Reducing the time needed to reload a piece of text into the model by caching the state

@bitRAKE Yes, those are transformer's hidden state, preserving them is sufficient. Now, the question is how to edit them properly. I'm also interested in removing n first elements to deal...

Batch size affects model's output

That's incorrect and it shouldn't sacrifice anything. It also should be faster on CPU. All Pytorch transformers I had to run on CPU were significantly faster at reading prompts than...

Batch size affects model's output

They are not being computed at the same time. Computations in one layer are separated in three steps I listed above. Step 2 operates on Query-Key-Value matrices which were already...

1
2
›