turboderp

Results 180 comments of turboderp

So, I didn't read up on CFG yet, but it looks like you're essentially doing two generations in parallel and mixing the logits..? If that's the case, you would need...

Okay, I wrote up an example in `example_logit_mixing.py` of a way to do it using batching. I didn't call it a CFG example because I'm not sure if there are...

Wouldn't it be hard to do anything useful with the remaining VRAM, though? Fitting the model weights on the GPU is one thing, but to run inference you need quite...

In elements: 2 * num_layers * batch_size * num_attn_heads * key_value_dim * seq_len = 2 * num_layers * batch_size * hidden_dim * seq_len So for half precision, multiply the whole...

Silly me, I was thinking about swapping state in and out of VRAM. Of course you meant streaming just _weights_, which would be read only. I can't see why that...

I'll have a look later. It may be there's some delay there I just never noticed because prompt processing is literally a hundred times faster on the 4090, apparently. But...

Okay, that is quite a delay there. I had a look and there's no processing happening between when it prints the prompt speed and when it creates the frame to...

I'm going to be looking at LoRAs soon, probably over the weekend. Are there any particular adapters on HF you're interested in, just so I have some reference points?

If you merge the LoRA with the original model, convert that to GPTQ and load it in ExLlama, it should be loading correctly. As for loading the LoRA separately, support...

I don't need to know about the dataset, but there are a bunch of different approaches to training LoRAs, lots of repos that use slightly different methods, adapting different layers...