optillm
optillm copied to clipboard
(MOA) Fails with "List Index Out of Range" Error on OpenAI-Compatible Ollama API Endpoint
Description:
Using the MOA approach in the Ollama API via an OpenAI-compatible endpoint results in a list index out of range error. The request fails to return a valid response.
Steps to Reproduce:
-
Backend Configuration (
optillm.py):-
Modified get_config(): in
optillm.pyto set API key and base URL explicitly (did not work somehow over command-line parameters! It would ask openai url directly no matter base_path/url somehow!) :default_client = OpenAI(api_key="ollama", base_url="http://192.168.1.224:11434/v1")
-
-
Code Snippet:
import requests import json url = "http://192.168.1.247:8000/v1/chat/completions" payload = json.dumps({ "model": "qwen2.5:72b-instruct-q4_K_S", "messages": [ {"role": "user", "content": "<optillm_approach>moa</optillm_approach> Dwarf Fortress Production Chain are how many?"} ], "temperature": 0.2 }) headers = { 'Authorization': 'Bearer ollama', 'Content-Type': 'application/json' } response = requests.post(url, headers=headers, data=payload) print(response.text) -
Run the Code:
- Execute the script using Python or curl manually.
Expected Behavior:
The API should return a valid response containing the number of Dwarf Fortress Production Chains based on the MOA approach.
Actual Behavior:
The API returns an error message:
{
"error": "list index out of range"
}
Additional Information:
- API Endpoint:
http://192.168.1.247:8000/v1/chat/completions(OpenAI-compatible ollama) - Model Used:
qwen2.5:72b-instruct-q4_K_S - MOA Approach Tag:
<optillm_approach>moa</optillm_approach> - Request Method: POST
- Headers:
- Authorization: Bearer ollama
- Content-Type: application/json
Potential Causes:
- Incorrect model configuration.
- Data structure issues in the payload.
- API endpoint bug with MOA approach.
Suggested Fixes ? Who to address to?:
- Verify model compatibility with MOA and OpenAI-compatible endpoint.
- Report to Ollama API support for further investigation.
Thank you for addressing this issue.
I have read that:
https://github.com/codelion/optillm/blob/193ab3c4d54f5f2e2c47525293bd7827b609675f/README.md?plain=1#L51
but it seems not all supported although ollama?
seems like it will not produce 3 completions ..
def mixture_of_agents(system_prompt: str, initial_query: str, client, model: str) -> str:
moa_completion_tokens = 0
completions = []
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": initial_query}
],
max_tokens=60000, # 20000, # 4096,
n=3,
temperature=1
)
completions = [choice.message.content for choice in response.choices]
moa_completion_tokens += response.usage.completion_tokens
out of range looked at the beginning like it came from somewhere else, but it is from the moa.py directly. Could be then done in a loop instead?
2024-10-13 13:18:05,096 - ERROR - Error processing request: list index out of range
2024-10-13 13:18:05,097 - INFO - 192.168.1.247 - - [13/Oct/2024 13:18:05] "POST /v1/chat/completions HTTP/1.1" 500 -
seems like now in hindsight easy to see, but only when I debugged it. Maybe unsupported would be a better indicator as log.
Like you, I recently tried MOA, MCTS, BON, and PVG with Ollama, but encountered issues.
The primary reason is that Ollama, similar to llama.cpp, doesn't support multiple completions. (Ollama uses llama.cpp internally.)
I've forked optillm and made modifications to address this limitation. https://github.com/s-hironobu/optillm
While I can't guarantee the correctness of my solution, please try it out if you like.
If you try my version, please read : https://github.com/s-hironobu/optillm?tab=readme-ov-file#quick-start-with-ollama
I used to have an implementation similar to what you have below, but unfortunately it is not equivalent to doing multiple generations from the model with the single request due to how decoding is done during response generation.
for _ in range(n):
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
max_tokens=4096,
n=1,
temperature=1
)
I will update the README to add a note that ollama also doesn't generate multiple responses. Thanks.
Yes, i added that in the same way for my run.
BTW: https://github.com/codelion/optillm/blob/41821aacdc70b9c6b65f4663eabbf3aa230cd37d/optillm/bon.py#L30
Here my LLM will not answer only with a number .. I guess the user instruction is higher prio than system (based on prompts?). Adapting the last message worked then. But better to consider a instring etc approach.
Enh: I am working on a simulation of streaming response (final response feel just copy paste on openwebui since no streaming) and possibility to yield thinking process before final responses back in response to make the duration feel less long ..
But better to consider a instring etc approach.
How would this look like? Do you have a better prompt that works?