chatllm.cpp icon indicating copy to clipboard operation
chatllm.cpp copied to clipboard

bge-reranker is extreamly slow

Open RobinQu opened this issue 1 year ago • 10 comments
trafficstars

With 686 tokens, a single run would take more than 6 secodns on a 96C machine.

Here is the profiling data for compute graph. bge-reranker-dump.txt

Any advice for better performance?

RobinQu avatar Jun 20 '24 08:06 RobinQu

This is strange. You can check your CPU utilization, and try -n 96.

foldl avatar Jun 20 '24 13:06 foldl

This is strange. You can check your CPU utilization, and try -n 96.

I wrote a simple test to reproduce the issue: https://github.com/foldl/chatllm.cpp/pull/25

PS. I accidentally created a PR which is cancelled. Please just ignore the related notifications.

RobinQu avatar Jun 20 '24 13:06 RobinQu

I tested with num_thread=1 and num_thread=96. Single thread setup is slower than the 96-thread setup. Within a loop of 100 iterations, all cores are fully burned, so I believe the it's correctly scheduled.

RobinQu avatar Jun 20 '24 13:06 RobinQu

I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735.

https://raw.githubusercontent.com/huggingface/hf-endpointsdocumentation/main/docs/source/guides/create_endpoint.mdx

Let me assume that the model file is saved on SSD. The result (>6 sec) looks impossible.

foldl avatar Jun 21 '24 04:06 foldl

I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735.

https://raw.githubusercontent.com/huggingface/hf-endpointsdocumentation/main/docs/source/guides/create_endpoint.mdx

Let me assume that the model file is saved on SSD. The result (>6 sec) looks impossible.

Could you share the SHA256 of the checkpoint you are using? So that I can check if my conversion is valid.

The bin file is converted using your convert.py script, and its SHA256 digest is b3e05dbe06c0aa52fd974d9c9dedbc51292b81f2f285d56113c060a0931a7f0f.

RobinQu avatar Jun 21 '24 08:06 RobinQu

You can find some quantized models (BGE-Reranker included) here:

https://modelscope.cn/models/judd2024/chatllm_quantized_models/files

I have test both Q8 and Q4_1. This model is very small, and throughput should be much higher.

foldl avatar Jun 21 '24 11:06 foldl

You can find some quantized models (BGE-Reranker included) here:

https://modelscope.cn/models/judd2024/chatllm_quantized_models/files

I have test both Q8 and Q4_1. This model is very small, and throughput should be much higher.

Well, I downloaded models you mentioned and re-run the tests.

With Q4 variants of BGE models, it indeed has latency around 2s. I was testing with Q8 models.

https://github.com/RobinQu/chatllm.cpp/blob/perf/test.cpp

qa_rank: num_threads=192, elapsed=1785
qa_rank: num_threads=96, elapsed=758
qa_rank: num_threads=48, elapsed=762
qa_rank: num_threads=24, elapsed=1013
qa_rank: num_threads=12, elapsed=1545
qa_rank: num_threads=6, elapsed=2439
qa_rank: num_threads=3, elapsed=4390
qa_rank: num_threads=1, elapsed=11932
text_embedding: num_threads=192, elapsed=3444
text_embedding: num_threads=96, elapsed=745
text_embedding: num_threads=48, elapsed=754
text_embedding: num_threads=24, elapsed=1017
text_embedding: num_threads=12, elapsed=1546
text_embedding: num_threads=6, elapsed=2432
text_embedding: num_threads=3, elapsed=4374
text_embedding: num_threads=1, elapsed=11878

BTW, it seems threads more than 12 won't help too much in terms of latency. I would try with a multi-instance setup for higher thourughput in production.

Any other advice?

RobinQu avatar Jun 23 '24 05:06 RobinQu

On a 96C machine, data shows that throughput is saturated at just 48 threads, so RAM throughput is the bottleneck now. A simple calculation: assuming model file size is 2GB, 400 tokens per second requires RAM throughput > 800GB/s.

foldl avatar Jun 23 '24 06:06 foldl

Oh, such calculation applies to token generation, but not batch prompt evaluation.

foldl avatar Jun 24 '24 01:06 foldl

EPYC 9004 series claim to have 460GB/s bandwidth for single socket configuration. But the benchmarks show that inference won't benfit too much from threads of more than 48, or multi-instances. So I think you are right about the bottleneck on RAM throughput.

Maybe optimizaiton like flashattention should be considered. But I am not sure if it perfroms well on CPU.

RobinQu avatar Jun 24 '24 03:06 RobinQu

GPU acceleration looks ok for bge-reranker-m3.

https://github.com/foldl/chatllm.cpp/blob/master/docs/gpu.md

foldl avatar Feb 10 '25 15:02 foldl