dongqi shen
dongqi shen
好的,谢谢,感觉这个插件很棒,美中不足就是不能输入,Thanks
> So does the attention head number get included? Yes, It does. Actually, for each head, the attention layer project input (which is [768]) to a small size (which is...
Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as...
When the kernel received pytorch tensor as argument, function `get_torch_callbacks(v, ...)` will check it with `v.is_contiguous()`. However the function `.from_torch()` just simply call `.contiguous()` as you described in #4258. I...
fantastic! That works for me!
I think one possible reason is the tokenizer. The vocab size of llama is 32000, but size of chatglm-6b is about 150000.
@jinfagang The member's explanation is [here](https://github.com/THUDM/ChatGLM-6B/issues/127#issuecomment-1473366712). Not know much about tokenizer, sry about that. However, from my tests, I think ChatGLM-6b behaves much better than LLaMa-7b in Chinese.
I have tested it with Qwen-1.8B on RTX 2080, and the reasoning acceleration is about twice the time compared to the original (50 tok/s vs ~100 tok/s) which is fascinating....
@dashi6174 https://github.com/DongqiShen/qwen-fast