dblacknc
dblacknc
Confirmed it now runs with --load-in-8bit
The bug report is response N is printed in the verbose log only after prompt N+1 is entered. Sorry for any confusion from my also mentioning the entire history is...
If I run "netstat -an | fgrep :80" and watch for the TIME_WAIT connections to go away (no output), it'll then start. It has been a very long time since...
Looks like with --gpu-memory 7100MiB it starts pushing some layers to cpu. I'm thinking part of the challenge is with the llava extension active the GPU already has 1.8-2.0 GB...
The line is 7100 - pushes a few layers to the CPU, and 7200 doesn't. However with 7200 (and above) it overruns the 12 GB VRAM with many prompts.
OK - confirmed, --pre_layer is allowing CPU offload to work with GPTQ. I found a couple other related things: --auto-devices seems to be unconditionally enabled. I can omit it and...
OK - thanks for the explanation, and pointer to the README for LLaVA for more info. Closing as looks like it's not a bug, and I'll assume for now the...
this model: wojtab_llava-13b-v0-4bit-128g
I reported the same in issue #1632
I'm just trying RWKV and it's working well for me. Not running in a container though. I'm using an Ubuntu 22.04 KVM VM with 64 GB RAM and passing through...