turboderp

Results 180 comments of turboderp

It's not expected, no. I have no explanation for it. Are you able to generate anything while the GPU is in this state?

It must be a ROCm issue of some sort, because there's nothing running in the background, no threads or anything. There's the asynchronous device queue, but the host code synchronizes...

Well, it's a Torch kernel (`elementwise_kernel`) which unfortunately is called all the time for any sort of element-wise operation, so it's anyone's guess what it's doing. But it's definitely a...

What is the sequence length set to in the model config? Maybe something weird is happening if you haven't changed it from the default (2048), and it tries to generate...

Is there more of this error message? ``` File "/codebase/research/exllama/model.py", line 556, in forward cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: start (3313) + length (808) exceeds ``` It looks...

@w013nad You wouldn't need to hard-code new values into the config class. You can just override the values after creating the config. Also, it looks like that config file is...

This is a limitation of the safetensors library. It insists on memory-mapping the input tensor file, which means that even though it isn't actually reading more than a little bit...

I don't know the situation around running CUDA on Macs, if that's even possible, but yes, if you're trying to run it on Metal you definitely won't get very far....

ExLlama pre-allocates the whole context, so it uses the same amount of VRAM (roughly) no matter how long your context is. Setting the max sequence length to something really short...

That is pretty big. You're already bordering on 40 GB for the model + LoRA. Add a gigabyte or two for Torch, and even with GQA there isn't much left...