turboderp

Results 180 comments of turboderp

Well, like I said, you're not compressing the _content_ of the context. It's not like it has a fuzzier recollection of tokens when their positional embeddings are closer together. It's...

>There is a flag for gptq/torch called use_cuda_fp16 = False that gives a massive speed boost -- is it possible to do something similar in exllama? Well, it would give...

I finished some more thorough tests now, and it's actually kind of promising. Perhaps @kaiokendev would be interested as well: ![superhot_test](https://github.com/turboderp/exllama/assets/11859846/0b08e754-0f01-4a33-85f8-876c16bee68a) This is running a perplexity test on a number...

I'm curious how you're configuring the model in this case? If you're running with `max_seq_len = 8192` in all cases, then the model is correctly allocating the full cache in...

Yep. It's necessary to avoid memory fragmentation, but it also makes more sense to me to allocate up front what you can predict you're eventually going to need anyway. But...

Nah, it's fine. It explains it well enough, since it looks like they're a little behind with their packaging of ExLlama as a library.

@alkeryn I think it's a little premature to start demanding that the model understand multiple scales, before there's anything to suggest it needs more than one scale. @kaiokendev I noticed...

@QM60 I'm not really having trouble running 8k contexts for 13B. But for 33B, yes, it's going to be trickier. I do have a second 24 GB GPU, luckily. So...

@Jeduh You're still teaching the model two different behaviors that have to coexist. Much harder than just modifying one existing behavior. And you need some kind of rationale anyway. What...