dspy (Fixable?) 400 Error with vLLM API

Howdy. It seems when running a vLLM server and then attempting to interact with it via HFClientVLLM, I get an error message. Here is how to reproduce:

# Computer 1
pip install ray==2.20.0 vllm==0.4.2 dspy-ai==2.4.9 flash_attn==2.5.8
ray start --head --num-gpus 1
# Computer 2. Address will be Computer 1's IP address
pip install ray==2.20.0 vllm==0.4.2 dspy-ai==2.4.9 flash_attn==2.5.8
ray start --address='192.168.250.20:6379' --num-gpus 1
# Computer 1
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --host 0.0.0.0 --port 8000 -tp 2 --seed 42
# Any computer on network (including Computer 1 and 2). Address will be Computer 1's IP address
pip install dspy-ai==2.4.9 # If not already installed
python -c 'import dspy;lm = dspy.HFClientVLLM(model="facebook/opt-125m", port=8000, url="http://192.168.250.20", seed=42);dspy.configure(lm=lm);print(lm("Test"))'

If you only have one computer, this gives same output

pip install ray==2.20.0 vllm==0.4.2 dspy-ai==2.4.9 flash_attn==2.5.8
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --host 0.0.0.0 --port 8000 --seed 42
# Different tab
python -c 'import dspy;lm = dspy.HFClientVLLM(model="facebook/opt-125m", port=8000, url="http://localhost", seed=42);dspy.configure(lm=lm);print(lm("Test"))'

This gives me an output of

Failed to parse JSON response: {"object":"error","message":"[{'type': 'extra_forbidden', 'loc': ('body', 'port'), 'msg': 'Extra inputs are not permitted', 'input': 8000, 'url': 'https://errors.pydantic.dev/2.5/v/extra_forbidden'}, {'type': 'extra_forbidden', 'loc': ('body', 'url'), 'msg': 'Extra inputs are not permitted', 'input': ['http://192.168.250.10:8000'], 'url': 'https://errors.pydantic.dev/2.5/v/extra_forbidden'}]","type":"BadRequestError","param":null,"code":400}
Traceback (most recent call last):
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/hf_client.py", line 199, in _generate
    completions = json_response["choices"]
KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/hf.py", line 190, in __call__
    response = self.request(prompt, **kwargs)
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/lm.py", line 26, in request
    return self.basic_request(prompt, **kwargs)
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
  File "/home/daimyollc/anaconda3/envs/dspyVenv/lib/python3.10/site-packages/dsp/modules/hf_client.py", line 208, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server

Going into dsp/modules/hf_client.py, I tried commenting out this line:

payload = {
    "model": self.kwargs["model"],
    "prompt": prompt,
    # **kwargs, commented this line out!
}

Now, when I run

python -c 'import dspy;lm = dspy.HFClientVLLM(model="facebook/opt-125m", port=8000, url="http://localhost", seed=42);dspy.configure(lm=lm);print(lm("Test"))'

It returns

['osterone, Muscle Trackers, Water Race, Anabolic Shifting, Tally']

yay!

May 09 '24 20:05 ccruttjr

I found this issue while looking for solutions for a similar looking issue, which I am having with latest dspy (2.4.9). However, I didn't find the suggested workaround to have any effect in my scenario. You might want to check whether your issue is already present in dspy 2.4.7.

May 14 '24 20:05 Wolfsauge

This issue should be solved by https://github.com/stanfordnlp/dspy/pull/1043

May 22 '24 09:05 tom-doerr

TL;DR: I've tried the changes proposed in Issue #1025 and PR #1043 and I can confirm that the fix works with dspy version 2.4.9 and vllm version 0.4.3 at least for the minimal example described below

I was switching from ollama to vLLM in my dspy project and I ended up having the same problem with dspy version 2.4.9 and vllm version 0.4.3. So I tried the bare minimum example you provide in the documentation:

Server:

 pixi r python -m vllm.entrypoints.api_server --trust-remote-code --model meta-llama/Llama-2-7b-hf --port 8081

Code:

model="meta-llama/Llama-2-7b-hf"
lm = dspy.HFClientVLLM(model=model, port=8081, url="http://localhost")
dspy.configure(lm=lm)
qa = dspy.ChainOfThought('question -> answer')

response = qa(question="What is the capital of Paris?")
print(response.answer)

Output:

Failed to parse JSON response: {"detail":"Not Found"}
Traceback (most recent call last):
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 199, in _generate
    completions = json_response["choices"]
                  ~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jupyter/dev/funes/funes/main_test.py", line 23, in <module>
    response = qa(question="What is the capital of Paris?") #Prompted to vllm_llama2
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dspy/predict/predict.py", line 61, in __call__
    return self.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dspy/predict/chain_of_thought.py", line 59, in forward
    return super().forward(signature=signature, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dspy/predict/predict.py", line 103, in forward
    x, C = dsp.generate(template, **config)(x, stage=self.stage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/primitives/predict.py", line 77, in do_generate
    completions: list[dict[str, Any]] = generator(prompt, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/hf.py", line 190, in __call__
    response = self.request(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/lm.py", line 26, in request
    return self.basic_request(prompt, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jupyter/dev/funes/.pixi/envs/default/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 208, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server

I've tried to remove the **kwargs params from the payload in the hf_client.py as @ccruttjr suggested, but it doesn't work. Then I've tried the changes proposed in Issue #1025 and PR #1043 and I can confirm that it works.

Hope to see this soon in the new release! Thanks!

Jun 07 '24 18:06 francisco-perez-sorrosal

Maybe there should be a new DSPy package release using the latest code in main, the PR that fixes this is already merged

Jun 07 '24 18:06 tom-doerr

Maybe there should be a new DSPy package release using the latest code in main, the PR that fixes this is already merged

My dspy version is 2.4.17

and vllm

0.5.4

what fixes this issue @tom-doerr ?

Sep 14 '24 00:09 brando90

This installs vllm but dspy vllm server fails:

pip install --upgrade pip
pip uninstall torchvision vllm vllm-flash-attn flash-attn xformers
pip install torch==2.2.1 vllm==0.4.1

@tom-doerr any help?

Sep 14 '24 00:09 brando90

@brando90 How do you have 2.4.17, isn't the newest version 2.4.16? Do you get the exact same error message? If so, could you post relevant code?

Sep 14 '24 00:09 tom-doerr

@tom-doerr apologies, I don't have access to the bash session when I wrote that message. Likely a typo. I confirm I do have 2.4.16 though:

(uutils) brando9@skampere1~ $ pip list | grep dspy
dspy-ai                                 2.4.16

and vllm version is:

(uutils) brando9@skampere1~ $ pip list | grep vllm
vllm                                    0.4.1
vllm_nccl_cu12                          2.18.1.0.4.0
(uutils) brando9@skampere1~ $ pip list | grep torch
fast-pytorch-kmeans                     0.2.0.1
torch                                   2.2.1

my flash attention doesn't work fyi:

INFO 09-13 18:56:14 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.

thanks for taking the time to respond/help.

Sep 14 '24 01:09 brando90

Could you post the error message you are getting? Do you use any less commonly used DSPy features?

Sep 14 '24 02:09 tom-doerr

Could you post the error message you are getting? Do you use any less commonly used DSPy features?

@tom-doerr happy to help!

(snap_cluster_setup_py311) brando9@skampere1~ $ conda activate uutils
(uutils) brando9@skampere1~ $ python ~/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py

  0%|                                                                                                     | 0/3 [00:00<?, ?it/s]Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T02:04:17.834140Z [error    ] Failed to run or to evaluate example Example({'question': 'What is the capital of France?', 'answer': 'Paris'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f024e0cc0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T02:04:17.834909Z [error    ] Failed to run or to evaluate example Example({'question': "Who wrote '1984'?", 'answer': 'George Orwell'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f024e0cc0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
Failed to parse JSON response: {"detail":"Not Found"}
2024-09-14T02:04:17.835576Z [error    ] Failed to run or to evaluate example Example({'question': 'What is the boiling point of water?', 'answer': '100°C'}) (input_keys={'question'}) with <function exact_match_metric at 0x7f024e0cc0e0> due to Received invalid JSON response from server. [dspy.teleprompt.bootstrap] filename=bootstrap.py lineno=211
100%|████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 633.96it/s]
Bootstrapped 0 full traces after 3 examples in round 0.
Failed to parse JSON response: {"detail":"Not Found"}
Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 243, in _generate
    completions = json_response["choices"]
                  ~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 'choices'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lfs/skampere1/0/brando9/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py", line 65, in <module>
    pred = compiled_simple_qa(my_question)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/primitives/program.py", line 26, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/ultimate-utils/py_src/uutils/dspy_uu/examples/full_toy_vllm_local_mdl.py", line 50, in forward
    prediction = self.generate_answer(question=question)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/primitives/program.py", line 26, in __call__
    return self.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/chain_of_thought.py", line 36, in forward
    return self._predict(signature=signature, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 91, in __call__
    return self.forward(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 128, in forward
    completions = old_generate(demos, signature, kwargs, config, self.lm, self.stage)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dspy/predict/predict.py", line 155, in old_generate
    x, C = dsp.generate(template, **config)(x, stage=stage)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/primitives/predict.py", line 73, in do_generate
    completions: list[dict[str, Any]] = generator(prompt, **kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf.py", line 193, in __call__
    response = self.request(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/lm.py", line 27, in request
    return self.basic_request(prompt, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf.py", line 147, in basic_request
    response = self._generate(prompt, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/skampere1/0/brando9/miniconda/envs/uutils/lib/python3.11/site-packages/dsp/modules/hf_client.py", line 252, in _generate
    raise Exception("Received invalid JSON response from server")
Exception: Received invalid JSON response from server

Sep 14 '24 02:09 brando90

vllm server running on the side:

(snap_cluster_setup_py311) brando9@skampere1~ $ conda activate uutils
(uutils) brando9@skampere1~ $ python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf --port 8080

INFO 09-13 19:04:13 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='meta-llama/Llama-2-7b-hf', speculative_config=None, tokenizer='meta-llama/Llama-2-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
INFO 09-13 19:04:13 utils.py:608] Found nccl from library /lfs/skampere1/0/brando9/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 09-13 19:04:14 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 09-13 19:04:14 selector.py:33] Using XFormers backend.
INFO 09-13 19:04:15 weight_utils.py:193] Using model weights format ['*.safetensors']
INFO 09-13 19:04:17 model_runner.py:173] Loading model weights took 12.5523 GB
INFO 09-13 19:04:18 gpu_executor.py:119] # GPU blocks: 7406, # CPU blocks: 512
INFO 09-13 19:04:19 model_runner.py:976] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-13 19:04:19 model_runner.py:980] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-13 19:04:23 model_runner.py:1057] Graph capturing finished in 4 secs.
INFO:     Started server process [348702]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

Sep 14 '24 02:09 brando90

source code here: https://github.com/brando90/ultimate-utils/blob/master/experiments/experiments/2024/september/09_13_2024.md

Sep 14 '24 02:09 brando90

Don't see how that's related to the 400 Error. I had the same issue you seem to have: https://github.com/stanfordnlp/dspy/issues/1041 This issue also seems to be relevant: https://github.com/stanfordnlp/dspy/issues/1242

Don't have a solution for you. You could switch to a different model, check if someone else posted a solution to your problem (there are quite a few issues related to this) or switch to the experimental new DSPy 2.5 which has a new backend.

Sep 14 '24 02:09 tom-doerr

Don't see how that's related to the 400 Error. I had the same issue you seem to have: #1041 This issue also seems to be relevant: #1242

Darn embarrassing, apologies. I must admit I've been quite sleep deprived and commented on the wrong issue.

How do I install 2.5?

Can't find it here https://github.com/stanfordnlp/dspy

Thanks for all the help btw!

Sep 14 '24 15:09 brando90

dspy
dspy copied to clipboard

(Fixable?) 400 Error with vLLM API - extra input

dspy dspy copied to clipboard

(Fixable?) 400 Error with vLLM API - extra input

dspy
dspy copied to clipboard