vllm
vllm copied to clipboard
Curl requests not working
Hi there,
I had a question regarding working with the API Server from the instructions here.
I am running this after running the docker command
Pull the Docker image with CUDA 11.8.
docker run --gpus all -it --rm --shm-size=8g nvcr.io/nvidia/pytorch:22.12-py3
and running these commands within the workspace
pip uninstall torch
pip install vllm
When running the default command python -m vllm.entrypoints.api_server
, the server doesn't connect, returning
INFO: Started server process [3820] INFO: Waiting for application startup. INFO: Application startup complete. ERROR: [Errno 99] error while attempting to bind on address ('::1', 8000, 0, 0): cannot assign requested address INFO: Waiting for application shutdown. INFO: Application shutdown complete.
I try different ports and it still doesn't work.
However, when I try a different host like 127.0.0.1 for instance through the --host parameter, it connects to the server and runs on Uvicorn but I am unable to curl requests to this server.
I try to do this by opening a 2nd terminal, running the docker image and the rest of the commands above again and then trying to run both of these curl requests but both are unable to connect returning: curl: (7) Failed to connect to 127.0.0.1 port 8000: Connection refused. I also tried them outside of the workspace and the requests still fail.
curl -X POST http://127.0.0.1:8000/generate -H "Content-Type: application/json" -d '{ "prompt": "San Francisco is a", "use_beam_search": true, "n": 4, "temperature": 0 }'
and
curl -X POST http://localhost:8000/generate -H "Content-Type: application/json" -d '{
"prompt": "San Francisco is a",
"use_beam_search": true,
"n": 4,
"temperature": 0
}'
This also applies to the OpenAI server, so it would be great to receive some support on this to send requests to the server. Appreciate the help!
This looks like a docker port exporting issue. Can you check whether you exposed port 8000 of your docker to your host? Reference. Or can you try to run curl within the docker?
Hi,
Thanks for your response. Yeah so port 8000 was exposed and I tried running the curl commands with the docker but it still was not able to find the host.
I'm wondering if this command python -m vllm.entrypoints.api_server
is working as expected as it was not for me.
Thanks!
Hi,
This worked for me:
python -m vllm.entrypoints.api_server --host 0.0.0.0 --port 8000
And when you run the image, add the port as well:
docker run --gpus all -it --rm -p 8000:8000 --shm-size=8g nvcr.io/nvidia/pytorch:22.12-py3
Hi Carlos,
Thanks for the help! I was able to connect to the server. However, I noticed that despite setting --gpus all
, when I ran nvidia-smi on another terminal, it wasn't using all the gpus. is there another arg to pass in for ensuring the models on the vllm server use all gpus?
Glad it helped. Sorry, I only have 1 GPU, so I cannot run that test.
Close this issue since it's a docker issue instead of a vLLM issue. Feel free to re-open if there any further issues.