Eric Curtin

Results 479 comments of Eric Curtin

We should probably wrap "llama-run" in a "ramalama-runner" python3 script while we are at it, that will prove useful.

@rhatdan @slemeur "ramalama serve"/"ramalama-server" is the endpoint Podman AI Lab would talk to if the projects were to converge I would suggest Podman AI Lab ignores commands like "ramalama run"...

@sallyom @codefromthecrypt @vpavlin I figured it out: `ramalama serve some_model` and `podman run -it --rm --network slirp4netns:allow_host_loopback=true -e OPENAI_API_BASE_URL=http://host.containers.internal:8080 -p 3000:8080 -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main` get open webui working...

Still gonna implement the multiple models under the same endpoint thing eventually though

@sallyom https://github.com/containers/ramalama/pull/1009 Fixes the first part of this issue you raised

I need to stick this kinda code in the middle, juggling a few plates: https://github.com/ericcurtin/anythingproxy

> So I have another [conference soon in NZ](https://www.cloudnativesummit.co/), so wondering if we're close on this. > > The use case is `ramalama serve my-chat-model my-embeddings-model` or just `serve` and...

A similar feature is being implemented in llama.cpp : https://github.com/ggml-org/llama.cpp/issues/13367 I think we should continue to do our version also regardless in RamaLama.

Somewhat a chicken and an egg scenario, but we could theoretically release first, push containers after (default to using :latest when new container isn't available yet)