ollama
ollama copied to clipboard
OpenAI API compatibility
Any chance you would consider mirroring OpenAI's API specs and output? e.g., /completions and /chat/completions. That way, it could be a drop-in replacement for the Python openai package by changing out the url.
That would be awesome and also embeddings!
yup I'll +1 on this too :-)
+1
+1
this would be a big win
prior work: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
and
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/api_like_OAI.py
yeah would be great1!
Thanks for the issue and comments, all! Sorry for not replying sooner. Which clients/use cases are you looking to use that require the OpenAI API? Quite a few folks have mentioned LlamaIndex (also: see #278!) Would love to know!
Interoperability with OpenAI projects, like Auto-GPT. If you check https://github.com/go-skynet/LocalAI, you can see that their API works with pretty much every project that uses the OpenAI endpoint, in most cases you just need to point an Environment Variable to it.
www.galactus.ai also
I was looking to connect to it with both Continue.dev (which supports Ollama explicitly) and LocalAI, so interop was my hope as well.
I'd love to be able to do this. I'm specifically looking at running ToolBench, MetaGPT and ChatDEV. I have MetaGPT ready to test with this if we get this working.
I'd like to throw in Ironclad's Rivet application expects an OpenAI API endpoint as well: https://github.com/Ironclad/rivet
+1. I would like to use ollama as a target for LibreChat: https://github.com/danny-avila/LibreChat/tree/main
+1
Yes this would be a plus one if we can get this working with openai api specs. Can someone notify me when this is done I might forget and this was one of the reasons I took a look at this project.
This would be pretty cool since Nextcloud instances could use a locally running ollama server. Nextcloud itself ships with openai/localai compatability (through a plugin).
AutoGen would be another usecase - https://microsoft.github.io/autogen/blog/2023/07/14/Local-LLMs/
+1
I'm surprised LiteLLM hasn't been mentioned in the thread yet. Found it from the README.md of Ollama repo today. "Call LLM APIs using the OpenAI format", 100+ of them, including Ollama. This worked for me:
pip install litellm
ollama pull codellama
litellm --model ollama/codellama --api_base http://localhost:11434 --temperature 0.3 --max_tokens 2048
Double check that the port, model name and parameters match your configuration and VRAM situation.
As an example, Continue.dev configuration then goes like this, OpenAI style:
default=OpenAI(
api_key="IGNORED",
model="ollama/codellama",
context_length=2048,
api_base="http://your_litellm_hostname:8000"
),
Set context_length and max_tokens as appropriate. 2048 is a conservative value if you're gpu-poor or aren't sure.
Note that LiteLLM/Uvicorn opens the API at 0.0.0.0:8000, it's not confined to localhost by default and people can piggyback on your server if it's not a private network. I believe you need to edit litellm source code here if you want to only serve localhost, then pip install -e .
from that local clone before running litellm
.
Thanks for mentioning us @vividfog ! (I'm the maintainer of LiteLLM) We allow you to create an OpenAI compatible proxy server for ollama
Here's a link to the section on our docs on how to do this: https://docs.litellm.ai/docs/proxy_server
Please let me know how we can make it better for the ollama community😃
Hey @vividfog thanks for this incredible tutorial.
I added it to our docs and gave you credit for it.
Docs: https://docs.litellm.ai/docs/proxy_server#tutorial-use-with-aiderautogencontinue-dev
If you have a twitter/linkedin - happy to link to that instead!
Wow, thanks for pointing to litellm @vividfog.
For anyone on Arch Linux (btw) and interested, I came up with a PKGBUILD that sets up litellm with ollama as a systemd service. You can check it out on the AUR. Feel free to get back to me with any feedback!
My initial advice was not complete I learned today. Continue.dev sends two parallel queries, one for the user task and another to summarize the conversation. And LiteLLM logs may show an error from Ollama after the second call. There's a fix for this client-side.
This Continue.dev configuration imports a wrapper that makes all calls sequential, queued:
- Import the QueuedLLM wrapper near the top of
config.py
:
from continuedev.src.continuedev.libs.llm.queued import QueuedLLM
- The server calls can now be made sequential like this:
models=Models(
default=QueuedLLM(
llm=OpenAI(
api_key="IGNORED",
model="ollama/codellama",
context_length=2048,
api_base="http://localhost:8000"
)
)
),
This may now be leaning off-topic vs. the original issue, but hope it helps those who used the previous advice. The friendly developers at Continue.dev Github/Discord are there if needed. I learned about the QueuedLLM wrapper initially in their Discord.
What remains a little confusing is that previously I've seen Ollama handle parallel API calls in sequence, or was I hallucinating? Not sure why QueuedLLM() is then needed, but if the shoe fits, wear it I guess. Material for another issue if someone wants to drill down and verify.
What I really like is how these 3 projects work together without knowing about each other at code level, as if following the same plan. That indeed is the benefit of following the same API conventions, the topic of this issue.
I realise its probably my lack of knowledge that is the probleme, but my Front end can use either LM Studio or oobabooga/text-generation-webui simply by change the base_api.
I wanted to try Ollama cause its seem to be doing a lot of thing simpler/faster.
But not supporting what seem to develop as the goto format for API, openAi api is a big minus. (i realise this is free, i dont want to be a choser/begger, just trying to provide feedback).
I try LiteLLM, and its not a drop-in replacement, and now, what was suposed to be simple, need to be debbuged.
So my feedback is, i hope Ollama gonna nativly have support for openAI API rather than rely on external Library that migh seem easy for ppl who know there stuff, but not as easy for ppl that went to Ollama for its simplicity.
I'm leaving my error log of LiteLLM just as reference, i know its not this project.
@mac ~ % litellm --drop_params --debug --model ollama/dolphin --api_base http://localhost:11434
ollama called
INFO: Started server process [42896]
INFO: Waiting for application startup.
#------------------------------------------------------------#
# #
# 'The thing I wish you improved is...' #
# https://github.com/BerriAI/litellm/issues/new #
# #
#------------------------------------------------------------#
Thank you for using LiteLLM! - Krrish & Ishaan
Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new
Docs: https://docs.litellm.ai/docs/simple_proxy
LiteLLM: Test your local endpoint with: "litellm --test" [In a new terminal tab]
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
litellm.caching: False; litellm.caching_with_models: False; litellm.cache: None
kwargs[caching]: False; litellm.cache: None
LiteLLM completion() model= dolphin
LiteLLM: Params passed to completion() {'functions': [], 'function_call': '', 'temperature': 0.7, 'top_p': 0.9, 'n': None, 'stream': False, 'stop': ['<.>'], 'max_tokens': 4096, 'presence_penalty': 0.5, 'frequency_penalty': 0.5, 'logit_bias': {}, 'user': '', 'model': 'dolphin', 'custom_llm_provider': 'ollama', 'repetition_penalty': 1.1, 'top_k': 20}
LiteLLM: Non-Default params passed to completion() {'temperature': 0.7, 'top_p': 0.9, 'stream': False, 'stop': ['<.>'], 'max_tokens': 4096, 'presence_penalty': 0.5, 'frequency_penalty': 0.5}
self.optional_params: {'num_predict': 4096, 'temperature': 0.7, 'top_p': 0.9, 'repeat_penalty': 0.5, 'stop_sequences': ['<.>'], 'repetition_penalty': 1.1, 'top_k': 20}
Logging Details Pre-API Call for call id b91948c3-ba26-4ebc-a140-c141a9e68764
MODEL CALL INPUT: {'model': 'dolphin', 'messages': [{'role': 'system', 'content': "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."}, {'role': 'user', 'content': 'USER : Tell me what you are in one phrase. ASSISTANT: '}], 'optional_params': {'num_predict': 4096, 'temperature': 0.7, 'top_p': 0.9, 'repeat_penalty': 0.5, 'stop_sequences': ['<.>'], 'repetition_penalty': 1.1, 'top_k': 20}, 'litellm_params': {'return_async': False, 'api_key': None, 'force_timeout': 600, 'logger_fn': None, 'verbose': False, 'custom_llm_provider': 'ollama', 'api_base': 'http://localhost:11434', 'litellm_call_id': 'b91948c3-ba26-4ebc-a140-c141a9e68764', 'model_alias_map': {}, 'completion_call_id': None, 'metadata': None, 'stream_response': {}}, 'start_time': datetime.datetime(2023, 11, 11, 10, 0, 17, 953683), 'input': "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER : Tell me what you are in one phrase. ASSISTANT: ", 'api_key': None, 'additional_args': {'complete_input_dict': {'text': "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER : Tell me what you are in one phrase. ASSISTANT: ", 'num_predict': 4096, 'temperature': 0.7, 'top_p': 0.9, 'repeat_penalty': 0.5, 'stop_sequences': ['<.>'], 'repetition_penalty': 1.1, 'top_k': 20}}, 'log_event_type': 'pre_api_call'}
Logging Details: logger_fn - None | callable(logger_fn) - False
Logging Details LiteLLM-Failure Call
self.failure_callback: []
An error occurred: Failed to parse: http://localhost:11434dolphin/generation
Debug this by setting `--debug`, e.g. `litellm --model gpt-3.5-turbo --debug`
INFO: 127.0.0.1:61413 - "POST /chat/completions HTTP/1.1" 200 OK
I agree about the speed of litellm vs the ollama server comment made by @MilleniumDawn. I may be wrong but I have noticed the native ollama server logs that my WSL GPU is being used, e.g. the following server message:
"ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660 Ti with Max-Q Design, compute capability 7.5"
I suspect that litellm server or workers are not using my GPU. If that is the case then it will explain the difference in speed.
Any comments/advice will be very welcomed.
@PetrarcaBruto nvidia-smi
should show the ollama runner process if GPU is utilized, like this:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:00:06.0 Off | 0 |
| N/A 37C P0 38W / 250W | 15261MiB / 40960MiB | 16% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 501 C ...p/gguf/build/cuda/bin/ollama-runner 15248MiB |
+---------------------------------------------------------------------------------------+
+1
Hey @MilleniumDawn i found the issue - it was being misrouted. Just pushed a fix - https://github.com/BerriAI/litellm/commit/1738341dcb16884bfff42a0b2004ba5afd856c5d
Should be live in v1.0.2
by EOD. I'm really sorry for that.
@PetrarcaBruto re: litellm workers
For ollama specifically - we check if you're making an ollama call, and run ollama serve
in a separate worker - https://github.com/BerriAI/litellm/blob/c7780cbc40b6d34144677d7979ba4318f0a0d5a9/litellm/proxy/proxy_cli.py#L20
open to suggestions for how we can improve this further.
@kylemclaren & @krrishdholakia thanks for the tips. I found that my GPU is being used also when running litellm which is good news.
That would be a great addition. I would love to use Ollama with TypingMind.