FastChat
FastChat copied to clipboard
vllm_worker add multi-lora support
Why are these changes needed?
Add multi-lora support for vllm_worker, this feature has been supported in vllm v0.3.2. This PR enables this capability in vllm_worker.
- Add a new argument
lora-modules
to support define multi-lora modules. - Auto register lora names as model_names, so that the lora modules can be called in /v1/models and related APIs.
- Convert the requests with model name as lora name to vllm LoRARequest for vllm to call the LoRA model for inference.
Related issue number (if applicable)
Closes #3107
Checks
- [x] I've run
format.sh
to lint the changes in this PR. - [x] I've included any doc changes needed.
- [x] I've made sure the relevant tests are passing (if applicable).
Fixed an error when loading model without enable-lora
https://github.com/wsvn53/FastChat/pull/1
BTW, can anyone help review this PR? @merrymercy @infwinston
Hi, have you tested the new version in multi-gpu environment? I tried it with below command:
CUDA_VISIBLE_DEVICES=0,1 python3 -m fastchat.serve.vllm_worker --model-path /path/to/Meta-Llama-3-8B-Instruct --dtype float16 --model-names gpt-3.5-turbo-0613 --conv-template llama-3 --lora-dtype float16 --device cuda --num-gpus 2 --enable-lora --lora-modules lora1=/path/to/lora_module/
But the vllm_worker raise below error:
RuntimeError: CUDA error: no kernel image is available for execution on the device
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
Here is the version of my package:
torch 2.3.0+cu121
fschat install from the main branch with modifications of the `vllm_worker.py` in this PR
vllm 0.5.0.post1
And I tried it in a machine with 2 * Nvidia-V100-16G
Below is the full error log:
WARNING 06-20 23:15:29 config.py:1222] Casting torch.bfloat16 to torch.float16.
2024-06-20 23:15:32,459 INFO worker.py:1753 -- Started a local Ray instance.
INFO 06-20 23:15:33 config.py:623] Defaulting to use mp for distributed inference
INFO 06-20 23:15:33 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='//work/OpenLLMs/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='//work/OpenLLMs/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=//work/OpenLLMs/Meta-Llama-3-8B-Instruct)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-20 23:15:34 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-20 23:15:34 selector.py:51] Using XFormers backend.
INFO 06-20 23:15:38 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-20 23:15:38 selector.py:51] Using XFormers backend.
INFO 06-20 23:15:38 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 06-20 23:15:38 utils.py:637] Found nccl from library libnccl.so.2
INFO 06-20 23:15:38 utils.py:637] Found nccl from library libnccl.so.2
INFO 06-20 23:15:38 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-20 23:15:38 pynccl.py:63] vLLM is using nccl==2.20.5
Traceback (most recent call last):
File "//miniconda3/envs/fschat2/lib/python3.9/multiprocessing/resource_tracker.py", line 201, in main
cache[rtype].remove(name)
KeyError: '/psm_f2db1090'
INFO 06-20 23:15:39 custom_all_reduce_utils.py:179] reading GPU P2P access cache from //.config/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-20 23:15:39 custom_all_reduce_utils.py:179] reading GPU P2P access cache from //.config/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-20 23:15:39 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-20 23:15:39 selector.py:51] Using XFormers backend.
INFO 06-20 23:15:39 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-20 23:15:39 selector.py:51] Using XFormers backend.
INFO 06-20 23:15:42 model_runner.py:160] Loading model weights took 7.4829 GB
INFO 06-20 23:15:42 model_runner.py:160] Loading model weights took 7.4829 GB
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: CUDA error: no kernel image is available for execution on the device
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] , Traceback (most recent call last):
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return func(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] self.model_runner.profile_run()
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return func(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 844, in profile_run
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] self.execute_model(seqs, kv_caches)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return func(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 749, in execute_model
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] hidden_states = model_executable(
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 371, in forward
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 288, in forward
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] hidden_states, residual = layer(
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 223, in forward
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] hidden_states = self.input_layernorm(hidden_states)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return self._call_impl(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return forward_call(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/custom_op.py", line 13, in forward
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] return self._forward_method(*args, **kwargs)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/layers/layernorm.py", line 61, in forward_cuda
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] out = torch.empty_like(x)
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] RuntimeError: CUDA error: no kernel image is available for execution on the device
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226]
ERROR 06-20 23:15:43 multiproc_worker_utils.py:226]
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: Traceback (most recent call last):
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return _run_code(code, main_globals, None,
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/runpy.py", line 87, in _run_code
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: exec(code, run_globals)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//work/LLMAgent/FastChat/fastchat/serve/vllm_worker.py", line 371, in <module>
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: engine = cls(
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 349, in __init__
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: self.engine = self._init_engine(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return engine_class(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 236, in __init__
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: self._initialize_kv_caches()
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: self.model_executor.determine_num_available_blocks())
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: num_blocks = self._run_workers("determine_num_available_blocks", )
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: self.model_runner.profile_run()
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 844, in profile_run
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: self.execute_model(seqs, kv_caches)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return func(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 749, in execute_model
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: hidden_states = model_executable(
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return self._call_impl(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return forward_call(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 371, in forward
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: hidden_states = self.model(input_ids, positions, kv_caches,
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return self._call_impl(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return forward_call(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 288, in forward
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: hidden_states, residual = layer(
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return self._call_impl(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return forward_call(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 223, in forward
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: hidden_states = self.input_layernorm(hidden_states)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return self._call_impl(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return forward_call(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/custom_op.py", line 13, in forward
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: return self._forward_method(*args, **kwargs)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: File "//miniconda3/envs/fschat2/lib/python3.9/site-packages/vllm/model_executor/layers/layernorm.py", line 61, in forward_cuda
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: out = torch.empty_like(x)
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: RuntimeError: CUDA error: no kernel image is available for execution on the device
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2024-06-20 23:15:43 | ERROR | stderr | [rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-06-20 23:15:43 | ERROR | stderr |
ERROR 06-20 23:15:44 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 267663 died, exit code: -15
INFO 06-20 23:15:44 multiproc_worker_utils.py:123] Killing local vLLM worker processes
2024-06-20 23:15:44 | INFO | stdout |
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]