MillionthOdin16

Results 85 comments of


                                            MillionthOdin16

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp)

Okay, we need some mmap people in here then. Because there's definitely something that changed with it users aren't getting a clear indication of what's going on other than horrible...

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp)

Are you using mlock? I think what's happening is the mmap is allowing you to load a larger model than you'd normally be able to load because you don't have...

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp)

I haven't seen any case where setting your thread count high significantly improves people's performance performance. If you're on Intel you want to set your thread count to the number...

[Feature] Dynamic Model Loading and Model Endpoint in FastAPI

@abetlen Here's something that seemed interesting from vicuna that I just saw. I can definitely see the challenge trying to adapt to all these different input formats. This seemed like...

server.py not starting with GPTQ latest git 534edc7

Change `model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits)`to `model = load_quant(str(path_to_model), str(pt_path), shared.args.gptq_bits, -1)` From the args documentation -1 sets to default size.

server.py not starting with GPTQ latest git 534edc7

Yea, it looks like there's more issues with the GPTQ changes today than just syntax. I rolled back the GPTQ repo to yesterdays version without any of his changes today...

server.py not starting with GPTQ latest git 534edc7

I actually don't know anymore... It seems like it might be more broken than I thought. I'm using the pre-quantized models from HF, so you might be right about versions...

server.py not starting with GPTQ latest git 534edc7

> If anyone needs a known good hash to roll back to, you can reset here (make sure to run this in the GPTQ-for-LLaMa repo, of course) > > ```...

server.py not starting with GPTQ latest git 534edc7

Awesome. Thanks

server.py not starting with GPTQ latest git 534edc7

I wonder if they are actually testing on a quantized model, or a non-quantized one. I don't know where to go from here haha

‹
1
2
3
4
5
6
7
8
9
›