turboderp comments

Results 180 comments of


                                            turboderp

GPU Usage Keeps High Even Without Inference Load

It's not expected, no. I have no explanation for it. Are you able to generate anything while the GPU is in this state?

GPU Usage Keeps High Even Without Inference Load

It must be a ROCm issue of some sort, because there's nothing running in the background, no threads or anything. There's the asynchronous device queue, but the host code synchronizes...

GPU Usage Keeps High Even Without Inference Load

Well, it's a Torch kernel (`elementwise_kernel`) which unfortunately is called all the time for any sort of element-wise operation, so it's anyone's guess what it's doing. But it's definitely a...

Weird issue with context length

What is the sequence length set to in the model config? Maybe something weird is happening if you haven't changed it from the default (2048), and it tries to generate...

Weird issue with context length

Is there more of this error message? ``` File "/codebase/research/exllama/model.py", line 556, in forward cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: start (3313) + length (808) exceeds ``` It looks...

Weird issue with context length

@w013nad You wouldn't need to hard-code new values into the config class. You can just override the values after creating the config. Also, it looks like that config file is...

Possible to load model with low system ram?

This is a limitation of the safetensors library. It insists on memory-mapping the input tensor file, which means that even though it isn't actually reading more than a little bit...

OSError: CUDA_HOME environment variable is not set.

I don't know the situation around running CUDA on Macs, if that's even possible, but yes, if you're trying to run it on Metal you definitely won't get very far....

Illegal memory access when using a lora

ExLlama pre-allocates the whole context, so it uses the same amount of VRAM (roughly) no matter how long your context is. Setting the max sequence length to something really short...

Illegal memory access when using a lora

That is pretty big. You're already bordering on 40 GB for the model + LoRA. Add a gigabyte or two for Torch, and even with GQA there isn't much left...