optillm (MOA) Fails with "List Index Out of Range" Error on OpenAI-Compatible Ollama API Endpoint

Description:

Using the MOA approach in the Ollama API via an OpenAI-compatible endpoint results in a list index out of range error. The request fails to return a valid response.

Steps to Reproduce:

Backend Configuration (optillm.py):
- Modified get_config(): in optillm.py to set API key and base URL explicitly (did not work somehow over command-line parameters! It would ask openai url directly no matter base_path/url somehow!) :
```
        default_client = OpenAI(api_key="ollama", base_url="http://192.168.1.224:11434/v1")
```

Code Snippet:

import requests
import json

url = "http://192.168.1.247:8000/v1/chat/completions"
payload = json.dumps({
    "model": "qwen2.5:72b-instruct-q4_K_S",
    "messages": [
        {"role": "user", "content": "<optillm_approach>moa</optillm_approach> Dwarf Fortress Production Chain are how many?"}
    ],
    "temperature": 0.2
})
headers = {
    'Authorization': 'Bearer ollama',
    'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers, data=payload)
print(response.text)

Run the Code:
- Execute the script using Python or curl manually.

Expected Behavior:

The API should return a valid response containing the number of Dwarf Fortress Production Chains based on the MOA approach.

Actual Behavior:

The API returns an error message:

{
    "error": "list index out of range"
}

Additional Information:

API Endpoint: http://192.168.1.247:8000/v1/chat/completions (OpenAI-compatible ollama)
Model Used: qwen2.5:72b-instruct-q4_K_S
MOA Approach Tag: <optillm_approach>moa</optillm_approach>
Request Method: POST
Headers:
- Authorization: Bearer ollama
- Content-Type: application/json

Potential Causes:

Incorrect model configuration.
Data structure issues in the payload.
API endpoint bug with MOA approach.

Suggested Fixes ? Who to address to?:

Verify model compatibility with MOA and OpenAI-compatible endpoint.
Report to Ollama API support for further investigation.

Thank you for addressing this issue.

Oct 13 '24 11:10 chrisoutwright

I have read that:

https://github.com/codelion/optillm/blob/193ab3c4d54f5f2e2c47525293bd7827b609675f/README.md?plain=1#L51

but it seems not all supported although ollama?

Oct 13 '24 11:10 chrisoutwright

seems like it will not produce 3 completions ..

def mixture_of_agents(system_prompt: str, initial_query: str, client, model: str) -> str:
    moa_completion_tokens = 0
    completions = []

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": initial_query}
        ],
        max_tokens=60000, # 20000, # 4096,
        n=3,
        temperature=1
    )
    completions = [choice.message.content for choice in response.choices]
    moa_completion_tokens += response.usage.completion_tokens

out of range looked at the beginning like it came from somewhere else, but it is from the moa.py directly. Could be then done in a loop instead?

2024-10-13 13:18:05,096 - ERROR - Error processing request: list index out of range
2024-10-13 13:18:05,097 - INFO - 192.168.1.247 - - [13/Oct/2024 13:18:05] "POST /v1/chat/completions HTTP/1.1" 500 -

seems like now in hindsight easy to see, but only when I debugged it. Maybe unsupported would be a better indicator as log.

Oct 13 '24 11:10 chrisoutwright

Like you, I recently tried MOA, MCTS, BON, and PVG with Ollama, but encountered issues.

The primary reason is that Ollama, similar to llama.cpp, doesn't support multiple completions. (Ollama uses llama.cpp internally.)

I've forked optillm and made modifications to address this limitation. https://github.com/s-hironobu/optillm

While I can't guarantee the correctness of my solution, please try it out if you like.

If you try my version, please read : https://github.com/s-hironobu/optillm?tab=readme-ov-file#quick-start-with-ollama

Oct 13 '24 14:10 s-hironobu

I used to have an implementation similar to what you have below, but unfortunately it is not equivalent to doing multiple generations from the model with the single request due to how decoding is done during response generation.

            for _ in range(n):
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=messages,
                    max_tokens=4096,
                    n=1,
                    temperature=1
                )

I will update the README to add a note that ollama also doesn't generate multiple responses. Thanks.

Oct 14 '24 07:10 codelion

Yes, i added that in the same way for my run.

BTW: https://github.com/codelion/optillm/blob/41821aacdc70b9c6b65f4663eabbf3aa230cd37d/optillm/bon.py#L30

Here my LLM will not answer only with a number .. I guess the user instruction is higher prio than system (based on prompts?). Adapting the last message worked then. But better to consider a instring etc approach.

Enh: I am working on a simulation of streaming response (final response feel just copy paste on openwebui since no streaming) and possibility to yield thinking process before final responses back in response to make the duration feel less long ..

Oct 15 '24 22:10 chrisoutwright

But better to consider a instring etc approach.

How would this look like? Do you have a better prompt that works?

Oct 16 '24 01:10 codelion

optillm optillm copied to clipboard

(MOA) Fails with "List Index Out of Range" Error on OpenAI-Compatible Ollama API Endpoint

Steps to Reproduce:

Expected Behavior:

Actual Behavior:

Additional Information:

Potential Causes:

Suggested Fixes ? Who to address to?:

optillm
optillm copied to clipboard