Eric Curtin
Eric Curtin
Hmmm weird, I hope we can get to the bottom of this
Should we flip --keep-groups to be on by default?
There's a fairly portable QUIC implementation now that the likes of curl uses https://github.com/ngtcp2/ngtcp2
I think the code from: https://github.com/ggml-org/llama.cpp/pull/17554 is the most usable CLI for "chat". It works well and you get all the functionality from the llama-server monolith binary. We should replace:...
> > and I would like to push forward with this PR regardless of what way this goes medium-term. > > If I understand correctly, by saying this, you mean...
> > I don't believe the barrier to entry is that high. > > Oh then you don't know how many users have given up on using llama.cpp just because...
> One more thought came to mind: how flexible are we on keeping the cli C++? I think looking at Python or JavaScript might be worthwhile. For example, leverage the...
Not against llama.cpp and Docker Model Runner teaming up also...
> Just want to relay Georgi's comment [#16603 (comment)](https://github.com/ggml-org/llama.cpp/pull/16603#issuecomment-3588688625) here: > > > I am OK with reorganizing the `llama-cli` tool an related if you have specific ideas - feel...
Thanks for highlighting this @kiview . Yes we are actively working on this, it is not ready, but please feel free to collaborate with us on this: https://github.com/docker/inference-engine-vllm https://github.com/vllm-project/vllm/pull/26160