John
John
I have my doubts on the error report. I'd like to see a fresh compiled CPU version that "works fine" with your model. There have been changes in 4_1 as...
> So there is no way at the moment to speed up this process? openBLAS (or preferably Intels One library) were said to speed that up. I've not tested it...
I just spent a couple hours on benchmarking and remembered this issue. There are two major factors that play a role currently: 1) The entire prompt needs to be processed,...
You should link the exact model you used, make sure it's what you have locally and not something else. Make sure you have ALL files from the HF dump in...
Okay that looks bad at first glance. ``` ENDOFTEXT = "" | IMSTART = "" | IMEND = "" | # as the default behavior is changed to allow special...
https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee Try that, it might convert their tokenizer format to HF, if that was not already done. I checked pull requests a bit and it looks like QWEN is likely...
Just guessing: after prompt was processed there can be a noticeable delay until the completions start. Also there are interactive modes that wait for return/enter.
I implemented that but I don't have a clean patch atm. It works well but you don't save that much performance. For example on my 3090 I have 15-18 iterations/second...
Thanks for the explanation, I think the help output needs a change. It describes "-c" as context for the prompt (didn't make much sense to me) not as context for...
> Let me try explain this. Thanks for the response. Here is one more view that's quite interesting. We can see the utilization of the cores is not spreading right,...