Cant change DTYPE inside VLLM settings

Open telekoteko opened this issue 1 year ago • 2 comments

LocalAI version: latest

Environment, CPU architecture, OS, and Version: Linux srv3 5.19.0-1010-nvidia-lowlatency #10-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 26 00:40:27 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Describe the bug Cant set dtype='half' for VLLM through .yaml or docker run args.

To Reproduce Create vllm.yaml inside models folder


name: vllm
backend: vllm
parameters:
  model: "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"

# Uncomment to specify a quantization method (optional)
quantization: "gptq"
# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
gpu_memory_utilization: 0.7
# Uncomment to trust remote code from huggingface
trust_remote_code: true
# Uncomment to enable eager execution
# enforce_eager: true
# Uncomment to specify the size of the CPU swap space per GPU (in GiB)
# swap_space: 2
# Uncomment to specify the maximum length of a sequence (including prompt and output)
max_model_len: 32000
tensor-parallel-size: 8
cuda: true

Start LocalAI sudo docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v /opt/localai/models:/models --name localai quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Run inference

curl http://localhost:8080/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "vllm",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Result: {"error":{"code":500,"message":"could not load model (no success): Unexpected err=ValueError('Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting thedtype flag in CLI, for example: --dtype=half.'), type(err)=\u003cclass 'ValueError'\u003e","type":""}}

Expected behavior I should be able to set dtype='half' for vllm.

Logs

Additional context

Mar 19 '24 18:03 telekoteko

Hello, thank you for reporting this issue. It seems like the issue you are encountering is related to the model's compatibility with the GPU's compute capability. The error message indicates that the Bfloat16 datatype is only supported on GPUs with compute capability of at least 8.0, while your system has a GPU with compute capability of 7.5. As a result, you're getting this error when trying to set dtype='half' in the VLLM settings.

As an workaround, you can use the float16 datatype instead by explicitly setting the dtype flag in the CLI, for example: --dtype=float16. Alternatively, you could try on a GPU with a compute capability of 8.0 or higher to use Bfloat16.

The main issue seems that the VLLM does not correctly set the GPU requirements for BFloat16, which will be fixed in the next releases.

I will move this issue to the enhancement up for grabs area since it's related to improving the VLLM integration.

Please let me know if you have any further questions!

I am a bot, an experiment of @mudler.

Apr 15 '24 17:04 localai-bot