turboderp
turboderp
Yeah, it's still `[UNUSED_TOKEN_145]` as far as SentencePiece is concerned. It should recognize the added tokens when encoding with `encode_special_tokens = True`. Using `decode_special_tokens = True` when decoding should also...
8-bit wouldn't be faster than 4-bit. And the perplexity wouldn't be much better, either, at least for large models. You'd be better off with one of the sparse methods.
4 bits is about the sweet spot for Llama where you get decent enough perplexity and also room for larger models, which has a much greater impact than quantization size....
I've tested that particular model, and it *should* work. I run it with `-gs 17.2,24` though. The error might be because it gives up on loading the entire model with...
Wait a minute, I know what this is, cause I had the same issue come to think of it. The `config.json` file on HF is wrong for that model. They...
@Ph0rk0z I'm not sure what quantization bnb uses, but if it's just RTN then yeah, there's going to be a big difference between 4-bit and 8-bit. GPTQ is a bit...
Groupsize has a negligible impact on performance, and the extra file size doesn't prevent 33B models from using full context on 24 GB. Act-order has a small impact on speed...
If you mean tensor parallelism as in `torch.distributed.tensor.parallel` then no, but I don't see any other GPTQ implementations using that specific API either. But in principle reordering the rows doesn't...
But how do you avoid gathering in any case? ~~Isn't the fundamental problem still the same, that if you split A in rows you need to split B in columns,...
I see. I will have to read that a little more closely, but going by the MLP example for instance, they aren't splitting the state (X here), they're cloning it...