llamafile
llamafile copied to clipboard
Is it possible to add an OpenAI-compatible embedding endpoint?
I was wondering if there's any possibility of adding an OpenAI-compatible embedding endpoint, or if that wouldn't make sense with the underlying models.
My goal is to be able to swap out our OpenAI-using RAG chat with llamafile as an alternative option, and it works great for /chat/completions, but I also need /embedding to support vector searches.
Thanks!
The server currently has its own /embedding
API but it'd be nice to have an OpenAI compatible endpoint too. Anyone else who wants this, please leave a comment telling us so. This would also be a good thing to propose to the upstream project, since if they do this before us, then it'll get merged here the next time we do a sync.
I'd very much appreciate this as well
I think this would make so much sense to add and totally be in-line with Mozilla's "healthy internet" mission.
It would make it so much easier to adapt tools originally designed for only OpenAI to be useable in a local context too, helping grow an ecosystem around Llamafile.
Totally agree. I have been playing with LangChain and interestingly it works with Llamafile. I got stuck when using 'OpenAIEmbeddings' from LangChain. It would be great to have OpenAI compliant embedding endpoint, so that many retrieval based tasks can be performed using Llamafile
Definitely would be a welcome addition, yes! :+1:
Edit: There is already an issue in 'llama.cpp' about lots of features that should go into the server, I added a comment there about this issue.
The OpenAI-compatible embeddings-endpoint is directly mentioned here, I realize: https://github.com/ggerganov/llama.cpp/issues/4216#issuecomment-1858542650
llama.cpp has this since 3 weeks now: https://github.com/ggerganov/llama.cpp/commit/c82d18e863fcde91b4b1109b1d0c73ea4470c405 so I guess it will be available the next time llama.cpp will be updated. You also have to pass "--embedding" to the server or it will generate empty vectors (https://github.com/ggerganov/llama.cpp/blob/5207b3fbc500f89dfe528693e96540956dbaed96/examples/server/README.md?plain=1#L35). Since about 2 weeks it should also support BERT models: https://github.com/ggerganov/llama.cpp/pull/5423 and I could convert & use https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 with the llama.cpp convert-hf-to-gguf.py
script and use it just fine.
Hi @jart @stlhood I wanted to follow up to see when the next sync release of llamafile might be available to include this fix on llama.cpp with openai embedding endpoints.
https://github.com/ggerganov/llama.cpp/pull/5190