optillm icon indicating copy to clipboard operation
optillm copied to clipboard

(MOA) Fails with "List Index Out of Range" Error on OpenAI-Compatible Ollama API Endpoint

Open chrisoutwright opened this issue 1 year ago • 4 comments

Description:

Using the MOA approach in the Ollama API via an OpenAI-compatible endpoint results in a list index out of range error. The request fails to return a valid response.

Steps to Reproduce:

  1. Backend Configuration (optillm.py):

    • Modified get_config(): in optillm.py to set API key and base URL explicitly (did not work somehow over command-line parameters! It would ask openai url directly no matter base_path/url somehow!) :

              default_client = OpenAI(api_key="ollama", base_url="http://192.168.1.224:11434/v1")
      
      
  2. Code Snippet:

    import requests
    import json
    
    url = "http://192.168.1.247:8000/v1/chat/completions"
    payload = json.dumps({
        "model": "qwen2.5:72b-instruct-q4_K_S",
        "messages": [
            {"role": "user", "content": "<optillm_approach>moa</optillm_approach> Dwarf Fortress Production Chain are how many?"}
        ],
        "temperature": 0.2
    })
    headers = {
        'Authorization': 'Bearer ollama',
        'Content-Type': 'application/json'
    }
    
    response = requests.post(url, headers=headers, data=payload)
    print(response.text)
    
  3. Run the Code:

    • Execute the script using Python or curl manually.

Expected Behavior:

The API should return a valid response containing the number of Dwarf Fortress Production Chains based on the MOA approach.

Actual Behavior:

The API returns an error message:

{
    "error": "list index out of range"
}

Additional Information:

  • API Endpoint: http://192.168.1.247:8000/v1/chat/completions (OpenAI-compatible ollama)
  • Model Used: qwen2.5:72b-instruct-q4_K_S
  • MOA Approach Tag: <optillm_approach>moa</optillm_approach>
  • Request Method: POST
  • Headers:
    • Authorization: Bearer ollama
    • Content-Type: application/json

Potential Causes:

  1. Incorrect model configuration.
  2. Data structure issues in the payload.
  3. API endpoint bug with MOA approach.

Suggested Fixes ? Who to address to?:

  1. Verify model compatibility with MOA and OpenAI-compatible endpoint.
  2. Report to Ollama API support for further investigation.

Thank you for addressing this issue.

chrisoutwright avatar Oct 13 '24 11:10 chrisoutwright

I have read that:

https://github.com/codelion/optillm/blob/193ab3c4d54f5f2e2c47525293bd7827b609675f/README.md?plain=1#L51

but it seems not all supported although ollama?

chrisoutwright avatar Oct 13 '24 11:10 chrisoutwright

seems like it will not produce 3 completions ..

def mixture_of_agents(system_prompt: str, initial_query: str, client, model: str) -> str:
    moa_completion_tokens = 0
    completions = []

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": initial_query}
        ],
        max_tokens=60000, # 20000, # 4096,
        n=3,
        temperature=1
    )
    completions = [choice.message.content for choice in response.choices]
    moa_completion_tokens += response.usage.completion_tokens

out of range looked at the beginning like it came from somewhere else, but it is from the moa.py directly. Could be then done in a loop instead?

image

2024-10-13 13:18:05,096 - ERROR - Error processing request: list index out of range
2024-10-13 13:18:05,097 - INFO - 192.168.1.247 - - [13/Oct/2024 13:18:05] "POST /v1/chat/completions HTTP/1.1" 500 -

seems like now in hindsight easy to see, but only when I debugged it. Maybe unsupported would be a better indicator as log.

chrisoutwright avatar Oct 13 '24 11:10 chrisoutwright

Like you, I recently tried MOA, MCTS, BON, and PVG with Ollama, but encountered issues.

The primary reason is that Ollama, similar to llama.cpp, doesn't support multiple completions. (Ollama uses llama.cpp internally.)

I've forked optillm and made modifications to address this limitation. https://github.com/s-hironobu/optillm

While I can't guarantee the correctness of my solution, please try it out if you like.

If you try my version, please read : https://github.com/s-hironobu/optillm?tab=readme-ov-file#quick-start-with-ollama

s-hironobu avatar Oct 13 '24 14:10 s-hironobu

I used to have an implementation similar to what you have below, but unfortunately it is not equivalent to doing multiple generations from the model with the single request due to how decoding is done during response generation.

            for _ in range(n):
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=messages,
                    max_tokens=4096,
                    n=1,
                    temperature=1
                )

I will update the README to add a note that ollama also doesn't generate multiple responses. Thanks.

codelion avatar Oct 14 '24 07:10 codelion

Yes, i added that in the same way for my run.

BTW: https://github.com/codelion/optillm/blob/41821aacdc70b9c6b65f4663eabbf3aa230cd37d/optillm/bon.py#L30

Here my LLM will not answer only with a number .. I guess the user instruction is higher prio than system (based on prompts?). Adapting the last message worked then. But better to consider a instring etc approach.

Enh: I am working on a simulation of streaming response (final response feel just copy paste on openwebui since no streaming) and possibility to yield thinking process before final responses back in response to make the duration feel less long ..

chrisoutwright avatar Oct 15 '24 22:10 chrisoutwright

But better to consider a instring etc approach.

How would this look like? Do you have a better prompt that works?

codelion avatar Oct 16 '24 01:10 codelion