turboderp
turboderp
>Maybe it was a silly try, but self.weight = tensors[key].half() did not work. That would turn the q4 weights into half types without converting them first. So that definitely wouldn't...
That's a new one. An internal error in SentencePiece would suggest either you've got a corrupted tokenizer.model or the wrong version of SentencePiece installed perhaps? I'm using 0.1.97, if that...
I can't think of anything else at the moment, really. That, or try a different model, or try downloading the tokenizer.model file again.
>I'm unclear of how both CPU and GPU could be saturated at the same time. PyTorch waits in a busy loop whenever it synchronizes a CUDA stream, as far as...
Having read up on it a bit, good performance on P40 might be a ways off, unfortunately. Apparently its FP16 performance is 1/64 of its FP32 performance. I guess it's...
Yep, it converts everything to FP32 on the fly. It's hard to get to 160 tokens/second that way, and hard to run a 30B model at full context length when...
I think I'd need to know for sure exactly when half2 support is provided by CUDA and when it isn't. Cause there's still a half2 path that needs to compile,...
The FP16 problem remains, but INT8 would present problems of its own. It's an integer type, after all, not a drop-in replacement for floats.
Could you elaborate? There are various more-or-less hacky ways to force shorter or longer replies from a language model, but no standard way of doing it. Is there a particular...
Here are the last pieces in the SentencePiece model: ``` 92530 [UNUSED_TOKEN_133] 92531 [UNUSED_TOKEN_134] 92532 [UNUSED_TOKEN_135] 92533 [UNUSED_TOKEN_136] 92534 [UNUSED_TOKEN_137] 92535 [UNUSED_TOKEN_138] 92536 [UNUSED_TOKEN_139] 92537 [UNUSED_TOKEN_140] 92538 [UNUSED_TOKEN_141] 92539 [UNUSED_TOKEN_142]...