RobinQu

Results 46 comments of RobinQu

> Thanks for the quick response. Glad to know that you already have a good idea on how to implement callbacks. > > Also, I understand the problems you mentioned....

ggml with cuda llama.cpp server-cuda dockerfile https://github.com/ggerganov/llama.cpp/blob/a27152b602b369e76f85b7cb7b872a321b7218f7/.devops/llama-server-cuda.Dockerfile#L12

It's in low priority. But as so many things are concerned, it should be noted.

Blocked by https://github.com/conan-io/conan/issues/16574

# General and tool-use * Not yet implemented * Stream API * Context compression * Properties like `tool_resources`, `temperature` are working only with `Assistant`. Some may not work on `Thread`...

> This is strange. You can check your CPU utilization, and try `-n 96`. I wrote a simple test to reproduce the issue: https://github.com/foldl/chatllm.cpp/pull/25 PS. I accidentally created a PR...

I tested with `num_thread=1` and `num_thread=96`. Single thread setup is slower than the 96-thread setup. Within a loop of 100 iterations, all cores are fully burned, so I believe the...

> I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735. > > https://raw.githubusercontent.com/huggingface/hf-endpointsdocumentation/main/docs/source/guides/create_endpoint.mdx > > Let me assume...

> You can find some quantized models (BGE-Reranker included) here: > > https://modelscope.cn/models/judd2024/chatllm_quantized_models/files > > I have test both Q8 and Q4_1. This model is very small, and throughput should...

EPYC 9004 series claim to have 460GB/s bandwidth for single socket configuration. But the benchmarks show that inference won't benfit too much from threads of more than 48, or multi-instances....