John comments

Results 101 comments of


                                            John

cuBLAS fails on same model - terminate called after throwing an instance of 'std::runtime_error' what(): unexpectedly reached end of file

I have my doubts on the error report. I'd like to see a fresh compiled CPU version that "works fine" with your model. There have been changes in 4_1 as...

Long time until generation starts when using big context

> So there is no way at the moment to speed up this process? openBLAS (or preferably Intels One library) were said to speed that up. I've not tested it...

Long time until generation starts when using big context

I just spent a couple hours on benchmarking and remembered this issue. There are two major factors that play a role currently: 1) The entire prompt needs to be processed,...

Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly.

You should link the exact model you used, make sure it's what you have locally and not something else. Make sure you have ALL files from the HF dump in...

Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly.

Okay that looks bad at first glance. ``` ENDOFTEXT = "" | IMSTART = "" | IMEND = "" | # as the default behavior is changed to allow special...

Qwen-72B-Chat conversion script does not treat <|im_start|> and <|im_end|> correctly.

https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee Try that, it might convert their tokenizer format to HF, if that was not already done. I checked pull requests a bit and it looks like QWEN is likely...

Running llama.cpp on android just prints out the question

Just guessing: after prompt was processed there can be a noticeable delay until the completions start. Also there are interactive modes that wait for return/enter.

Separate sampling steps for Highres. Fix

I implemented that but I don't have a clean patch atm. It works well but you don't save that much performance. For example on my 3090 I have 15-18 iterations/second...

Change ./main help output to better reflect context size's affect on generation length

Thanks for the explanation, I think the help output needs a change. It describes "-c" as context for the prompt (didn't make much sense to me) not as context for...

Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k)

> Let me try explain this. Thanks for the response. Here is one more view that's quite interesting. We can see the utilization of the cores is not spreading right,...