dspy Number of generation doesn't give 'n' results on other LM clients (vllm, ollama) except OpenAI

Number of generation doesn't give 'n' results on other LM clients (vllm, ollama) except OpenAI

Open AmoghM opened this issue 4 months ago • 4 comments

The param n determines the total number of generations but it only works with openAI and not with vLLM or Ollama. From the ollama logs, it appears that its/api/generate endpoint is hit 5 for the example below but the output is not concatenated.

from typing import List
import dspy

vllm_llm = dspy.HFClientVLLM(model="meta-llama/meta-llama-3-8B-Instruct", port=8000, url="http://localhost", temperature=0.7)
dspy.settings.configure(lm=vllm_llm)

class SciGen(dspy.Signature):
    """context and answer on science questions"""
    question = dspy.InputField(desc="Input question")
    context = dspy.OutputField(desc="context") 
    answer = dspy.OutputField(desc="Answer using only the context")

context_generator = dspy.Predict(SciGen, n=5)
response = context_generator(question="Why do we have solar eclipse?")

print(len(response.completions)) # Outputs 5 for openai call but 1 for ollama

I can use the for loop to hit the endpoint multiple times as mentioned in the docs but I am assuming the batching of request would provide better performance.

question = "Why do we have solar eclipse?"
for idx in range(5):
    response = context_generator(question=question, config=dict(temperature=0.4+0.0001*idx))
    print(f'{idx+1}.', response.completions)

Apr 27 '24 04:04 AmoghM

Hi @AmoghM , I believe this is related to a previously-surfaced issue with ollama only printing the first completion, regardless of the specified n.

It seems like ollama would need some batched request handling like done for OpenAI

This likely applies for vLLM with respect to its corresponding batching configuration.

Feel free to open a PR to add support for batching!

Apr 27 '24 17:04 arnavsinghvi11

AFAICT ollama uses basically range(n) to make multiple requests. That isn't causing the batches to be dropped.

I just think OpenAI is just the best at adhering to the metaprompt, but you can reproduce the problem behaviour using OpenAI by pushing the graph to use the "fields not generated" fallback logic.

For example:

import dspy
import os

llm = dspy.OpenAI(model='gpt-3.5-turbo', max_tokens=100)
dspy.settings.configure(lm=llm)

class SciGen(dspy.Signature):
    """context and answer on science questions"""
    question = dspy.InputField(desc="Input question")
    foo = dspy.OutputField(desc="foo_context") 
    bar = dspy.OutputField(desc="bar_context") 
    bazz = dspy.OutputField(desc="bazz_context") 
    buzz = dspy.OutputField(desc="buzz_context") 
    fuzz = dspy.OutputField(desc="fuzz_context") 
    fooz = dspy.OutputField(desc="fooz_context") 
    answer = dspy.OutputField(desc="Answer using only the contexts")

context_generator = dspy.Predict(SciGen, n=10)
response = context_generator(question="Why do we have solar eclipse?")

print(len(response.completions))
# 1

Apr 28 '24 10:04 mikeedjones

@mikeedjones I don't completely understand how the "fields not generated" fallback gets triggered when we do this dspy.Predict(SciGen, n=10) but we will get 10 responses if we did:

for idx in range(10):
    response.append(context_generator(question, config=dict(temperature=0.7+0.0001*idx)))

Apr 28 '24 20:04 AmoghM

It is triggered in the latter! The difference is that in the former, n is overwritten in the fallback logic, going from 10 to 1. In the latter, n =1 from the start, so its value isn't changed by the fallback logic.

Apr 28 '24 22:04 mikeedjones

dspy dspy copied to clipboard

Number of generation doesn't give 'n' results on other LM clients (vllm, ollama) except OpenAI

dspy
dspy copied to clipboard