ollama icon indicating copy to clipboard operation
ollama copied to clipboard

OpenAI API compatibility

Open handrew opened this issue 1 year ago • 45 comments

Any chance you would consider mirroring OpenAI's API specs and output? e.g., /completions and /chat/completions. That way, it could be a drop-in replacement for the Python openai package by changing out the url.

handrew avatar Aug 07 '23 22:08 handrew

That would be awesome and also embeddings!

priamai avatar Aug 10 '23 09:08 priamai

yup I'll +1 on this too :-)

hakt0-r avatar Aug 11 '23 02:08 hakt0-r

+1

kamuridesu avatar Aug 11 '23 19:08 kamuridesu

+1

loyaliu avatar Aug 30 '23 12:08 loyaliu

this would be a big win

prior work: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

and

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/api_like_OAI.py

colinricardo avatar Sep 01 '23 23:09 colinricardo

yeah would be great1!

ValValu avatar Sep 02 '23 01:09 ValValu

Thanks for the issue and comments, all! Sorry for not replying sooner. Which clients/use cases are you looking to use that require the OpenAI API? Quite a few folks have mentioned LlamaIndex (also: see #278!) Would love to know!

jmorganca avatar Sep 07 '23 13:09 jmorganca

Interoperability with OpenAI projects, like Auto-GPT. If you check https://github.com/go-skynet/LocalAI, you can see that their API works with pretty much every project that uses the OpenAI endpoint, in most cases you just need to point an Environment Variable to it.

kamuridesu avatar Sep 07 '23 14:09 kamuridesu

www.galactus.ai also

colinricardo avatar Sep 07 '23 17:09 colinricardo

I was looking to connect to it with both Continue.dev (which supports Ollama explicitly) and LocalAI, so interop was my hope as well.

cori avatar Sep 08 '23 11:09 cori

I'd love to be able to do this. I'm specifically looking at running ToolBench, MetaGPT and ChatDEV. I have MetaGPT ready to test with this if we get this working.

MchLrnX avatar Sep 19 '23 19:09 MchLrnX

I'd like to throw in Ironclad's Rivet application expects an OpenAI API endpoint as well: https://github.com/Ironclad/rivet

comalice avatar Sep 28 '23 21:09 comalice

+1. I would like to use ollama as a target for LibreChat: https://github.com/danny-avila/LibreChat/tree/main

mjtechguy avatar Sep 29 '23 03:09 mjtechguy

+1

jtoy avatar Sep 29 '23 13:09 jtoy

Yes this would be a plus one if we can get this working with openai api specs. Can someone notify me when this is done I might forget and this was one of the reasons I took a look at this project.

Anon2578 avatar Sep 30 '23 21:09 Anon2578

This would be pretty cool since Nextcloud instances could use a locally running ollama server. Nextcloud itself ships with openai/localai compatability (through a plugin).

shtrophic avatar Oct 01 '23 04:10 shtrophic

AutoGen would be another usecase - https://microsoft.github.io/autogen/blog/2023/07/14/Local-LLMs/

Nivek92 avatar Oct 04 '23 15:10 Nivek92

+1

rcalv002 avatar Oct 07 '23 20:10 rcalv002

I'm surprised LiteLLM hasn't been mentioned in the thread yet. Found it from the README.md of Ollama repo today. "Call LLM APIs using the OpenAI format", 100+ of them, including Ollama. This worked for me:

pip install litellm

ollama pull codellama

litellm --model ollama/codellama --api_base http://localhost:11434 --temperature 0.3 --max_tokens 2048

Double check that the port, model name and parameters match your configuration and VRAM situation.

As an example, Continue.dev configuration then goes like this, OpenAI style:

        default=OpenAI(
            api_key="IGNORED",
            model="ollama/codellama",
            context_length=2048,
            api_base="http://your_litellm_hostname:8000"
        ),

Set context_length and max_tokens as appropriate. 2048 is a conservative value if you're gpu-poor or aren't sure.

Note that LiteLLM/Uvicorn opens the API at 0.0.0.0:8000, it's not confined to localhost by default and people can piggyback on your server if it's not a private network. I believe you need to edit litellm source code here if you want to only serve localhost, then pip install -e . from that local clone before running litellm.

vividfog avatar Oct 07 '23 22:10 vividfog

Thanks for mentioning us @vividfog ! (I'm the maintainer of LiteLLM) We allow you to create an OpenAI compatible proxy server for ollama

Here's a link to the section on our docs on how to do this: https://docs.litellm.ai/docs/proxy_server

Please let me know how we can make it better for the ollama community😃

ishaan-jaff avatar Oct 07 '23 23:10 ishaan-jaff

Hey @vividfog thanks for this incredible tutorial.

I added it to our docs and gave you credit for it.

Docs: https://docs.litellm.ai/docs/proxy_server#tutorial-use-with-aiderautogencontinue-dev Screenshot 2023-10-07 at 5 55 12 PM

If you have a twitter/linkedin - happy to link to that instead!

krrishdholakia avatar Oct 08 '23 00:10 krrishdholakia

Wow, thanks for pointing to litellm @vividfog.

For anyone on Arch Linux (btw) and interested, I came up with a PKGBUILD that sets up litellm with ollama as a systemd service. You can check it out on the AUR. Feel free to get back to me with any feedback!

shtrophic avatar Oct 08 '23 16:10 shtrophic

My initial advice was not complete I learned today. Continue.dev sends two parallel queries, one for the user task and another to summarize the conversation. And LiteLLM logs may show an error from Ollama after the second call. There's a fix for this client-side.

This Continue.dev configuration imports a wrapper that makes all calls sequential, queued:

  1. Import the QueuedLLM wrapper near the top of config.py:
from continuedev.src.continuedev.libs.llm.queued import QueuedLLM
  1. The server calls can now be made sequential like this:
    models=Models(
        default=QueuedLLM(
            llm=OpenAI(
                api_key="IGNORED",
                model="ollama/codellama",
                context_length=2048,
                api_base="http://localhost:8000"
            )
        )
    ),

This may now be leaning off-topic vs. the original issue, but hope it helps those who used the previous advice. The friendly developers at Continue.dev Github/Discord are there if needed. I learned about the QueuedLLM wrapper initially in their Discord.

What remains a little confusing is that previously I've seen Ollama handle parallel API calls in sequence, or was I hallucinating? Not sure why QueuedLLM() is then needed, but if the shoe fits, wear it I guess. Material for another issue if someone wants to drill down and verify.

What I really like is how these 3 projects work together without knowing about each other at code level, as if following the same plan. That indeed is the benefit of following the same API conventions, the topic of this issue.

vividfog avatar Oct 08 '23 20:10 vividfog

I realise its probably my lack of knowledge that is the probleme, but my Front end can use either LM Studio or oobabooga/text-generation-webui simply by change the base_api.

I wanted to try Ollama cause its seem to be doing a lot of thing simpler/faster.

But not supporting what seem to develop as the goto format for API, openAi api is a big minus. (i realise this is free, i dont want to be a choser/begger, just trying to provide feedback).

I try LiteLLM, and its not a drop-in replacement, and now, what was suposed to be simple, need to be debbuged.

So my feedback is, i hope Ollama gonna nativly have support for openAI API rather than rely on external Library that migh seem easy for ppl who know there stuff, but not as easy for ppl that went to Ollama for its simplicity.

I'm leaving my error log of LiteLLM just as reference, i know its not this project.

@mac ~ % litellm --drop_params --debug --model ollama/dolphin --api_base http://localhost:11434
ollama called
INFO:     Started server process [42896]
INFO:     Waiting for application startup.

#------------------------------------------------------------#
#                                                            #
#            'The thing I wish you improved is...'            #
#        https://github.com/BerriAI/litellm/issues/new        #
#                                                            #
#------------------------------------------------------------#

 Thank you for using LiteLLM! - Krrish & Ishaan



Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new

Docs: https://docs.litellm.ai/docs/simple_proxy

LiteLLM: Test your local endpoint with: "litellm --test" [In a new terminal tab]


INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
litellm.caching: False; litellm.caching_with_models: False; litellm.cache: None
kwargs[caching]: False; litellm.cache: None

LiteLLM completion() model= dolphin

LiteLLM: Params passed to completion() {'functions': [], 'function_call': '', 'temperature': 0.7, 'top_p': 0.9, 'n': None, 'stream': False, 'stop': ['<.>'], 'max_tokens': 4096, 'presence_penalty': 0.5, 'frequency_penalty': 0.5, 'logit_bias': {}, 'user': '', 'model': 'dolphin', 'custom_llm_provider': 'ollama', 'repetition_penalty': 1.1, 'top_k': 20}

LiteLLM: Non-Default params passed to completion() {'temperature': 0.7, 'top_p': 0.9, 'stream': False, 'stop': ['<.>'], 'max_tokens': 4096, 'presence_penalty': 0.5, 'frequency_penalty': 0.5}
self.optional_params: {'num_predict': 4096, 'temperature': 0.7, 'top_p': 0.9, 'repeat_penalty': 0.5, 'stop_sequences': ['<.>'], 'repetition_penalty': 1.1, 'top_k': 20}
Logging Details Pre-API Call for call id b91948c3-ba26-4ebc-a140-c141a9e68764
MODEL CALL INPUT: {'model': 'dolphin', 'messages': [{'role': 'system', 'content': "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."}, {'role': 'user', 'content': 'USER : Tell me what you are in one phrase. ASSISTANT: '}], 'optional_params': {'num_predict': 4096, 'temperature': 0.7, 'top_p': 0.9, 'repeat_penalty': 0.5, 'stop_sequences': ['<.>'], 'repetition_penalty': 1.1, 'top_k': 20}, 'litellm_params': {'return_async': False, 'api_key': None, 'force_timeout': 600, 'logger_fn': None, 'verbose': False, 'custom_llm_provider': 'ollama', 'api_base': 'http://localhost:11434', 'litellm_call_id': 'b91948c3-ba26-4ebc-a140-c141a9e68764', 'model_alias_map': {}, 'completion_call_id': None, 'metadata': None, 'stream_response': {}}, 'start_time': datetime.datetime(2023, 11, 11, 10, 0, 17, 953683), 'input': "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER : Tell me what you are in one phrase. ASSISTANT: ", 'api_key': None, 'additional_args': {'complete_input_dict': {'text': "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER : Tell me what you are in one phrase. ASSISTANT: ", 'num_predict': 4096, 'temperature': 0.7, 'top_p': 0.9, 'repeat_penalty': 0.5, 'stop_sequences': ['<.>'], 'repetition_penalty': 1.1, 'top_k': 20}}, 'log_event_type': 'pre_api_call'}


Logging Details: logger_fn - None | callable(logger_fn) - False
Logging Details LiteLLM-Failure Call
self.failure_callback: []
An error occurred: Failed to parse: http://localhost:11434dolphin/generation

 Debug this by setting `--debug`, e.g. `litellm --model gpt-3.5-turbo --debug`
INFO:     127.0.0.1:61413 - "POST /chat/completions HTTP/1.1" 200 OK

MilleniumDawn avatar Nov 11 '23 15:11 MilleniumDawn

I agree about the speed of litellm vs the ollama server comment made by @MilleniumDawn. I may be wrong but I have noticed the native ollama server logs that my WSL GPU is being used, e.g. the following server message:

"ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1660 Ti with Max-Q Design, compute capability 7.5"

I suspect that litellm server or workers are not using my GPU. If that is the case then it will explain the difference in speed.

Any comments/advice will be very welcomed.

PetrarcaBruto avatar Nov 14 '23 04:11 PetrarcaBruto

@PetrarcaBruto nvidia-smi should show the ollama runner process if GPU is utilized, like this:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:00:06.0 Off |                    0 |
| N/A   37C    P0              38W / 250W |  15261MiB / 40960MiB |     16%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       501      C   ...p/gguf/build/cuda/bin/ollama-runner    15248MiB |
+---------------------------------------------------------------------------------------+

kylemclaren avatar Nov 14 '23 06:11 kylemclaren

+1

ghost avatar Nov 14 '23 23:11 ghost

Hey @MilleniumDawn i found the issue - it was being misrouted. Just pushed a fix - https://github.com/BerriAI/litellm/commit/1738341dcb16884bfff42a0b2004ba5afd856c5d

Should be live in v1.0.2 by EOD. I'm really sorry for that.

@PetrarcaBruto re: litellm workers

For ollama specifically - we check if you're making an ollama call, and run ollama serve in a separate worker - https://github.com/BerriAI/litellm/blob/c7780cbc40b6d34144677d7979ba4318f0a0d5a9/litellm/proxy/proxy_cli.py#L20

open to suggestions for how we can improve this further.

krrishdholakia avatar Nov 15 '23 02:11 krrishdholakia

@kylemclaren & @krrishdholakia thanks for the tips. I found that my GPU is being used also when running litellm which is good news.

PetrarcaBruto avatar Nov 15 '23 05:11 PetrarcaBruto

That would be a great addition. I would love to use Ollama with TypingMind.

patrickdobler avatar Nov 16 '23 05:11 patrickdobler