turboderp
turboderp
It's not really the format that matters for supporting the LoRA, just what layers are targeted by adapters and the datatype they're stored as. But I guess Ooba does have...
>"llama": ["q_proj", "v_proj"], Okay, so Q and V, that's what I was counting on. It should be simple enough. >Models do not change often/fast enough that dynanic loading of LoRA...
Well, LoRA support in ExLlama is still kind of experimental. It needs more testing and validation before I'd trust it. But it does *seem* to be working. And loading a...
@fraferra I'm going to look into it, but I'm a little cautious because there's a bit of a performance hit even for a single LoRA.
There are some people already working on APIs. But it is [on my list](https://github.com/turboderp/exllama/blob/master/TODO.md). I just need to do a little more research to figure out what the best, minimal...
I'm already working on optimizing the implementation to work better on the longer contexts. One of the changes is to automatically prevent attention operations from scaling too wildly, by doing...
Since this happens during loading, I suspect you're running out of memory. You'll sometimes just get CUDA illegal memory exceptions when that happens. But what is the model you're loading...
No, it shouldn't need the second GPU if the model fits on the first. I guess it might be a bug. You could try with `export CUDA_VISIBLE_DEVICES=0` and without any...
What's your GPU split in this case? And if you run `nvidia-smi` while the model is working, what's the output?
I can read it. The important thing is whether one card was right on the cusp of running out of memory, since that can sometimes give CUDA errors like that....