Judd

Results 57 comments of Judd
trafficstars

@MoonRide303 The bug related to Qwen2.5 1.5B is solved now. It's not caused by tied embedding, but by buffer allocation (not properly aligned). Use `-ngl 100,prolog,epilog` to run the whole...

This is strange. You can check your CPU utilization, and try `-n 96`.

I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735. https://raw.githubusercontent.com/huggingface/hf-endpointsdocumentation/main/docs/source/guides/create_endpoint.mdx Let me assume that the model file is...

You can find some quantized models (BGE-Reranker included) here: https://modelscope.cn/models/judd2024/chatllm_quantized_models/files I have test both Q8 and Q4_1. This model is very small, and throughput should be much higher.

On a 96C machine, data shows that throughput is saturated at just 48 threads, so RAM throughput is the bottleneck now. A simple calculation: assuming model file size is 2GB,...

Oh, such calculation applies to token generation, but not batch prompt evaluation.

GPU acceleration looks ok for bge-reranker-m3. https://github.com/foldl/chatllm.cpp/blob/master/docs/gpu.md

chatllm.cpp is not down-stream app of llama.cpp, but an app based on ggml just as llama.cpp. It supports some models that are not supported by llama.cpp, I won't wait for...

@lexasub Not ready yet. You can convert models with convert.py in one pass.

We are not going to support GGUF. See ggmm.md