MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument
Hi,
I am running VLLM on my 7900XTX(gfx1100). I use vllm serve ./qwen2-vl-instruct-pytorch-7b --dtype auto --port 8000 --limit_mm_per_prompt image=4 --max_model_len 8784 --gpu_memory_utilization 0.9
But then it shows errors:
$ vllm serve ./qwen2-vl-instruct-pytorch-7b --dtype auto --port 8000 --limit_mm_per_prompt image=4 --max_model_len 8784 --gpu_memory_utilization 0.9
WARNING 01-24 14:06:31 rocm.py:31] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
INFO 01-24 14:06:32 api_server.py:712] vLLM API server version 0.6.6.post1
INFO 01-24 14:06:32 api_server.py:713] args: Namespace(subparser='serve', model_tag='./qwen2-vl-instruct-pytorch-7b', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='./qwen2-vl-instruct-pytorch-7b', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8784, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 4}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f04d0f26f80>)
INFO 01-24 14:06:32 api_server.py:199] Started engine process with PID 91849
INFO 01-24 14:06:40 config.py:510] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 01-24 14:06:40 config.py:1338] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 01-24 14:06:44 config.py:510] This model supports multiple tasks: {'score', 'classify', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 01-24 14:06:44 config.py:1338] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 01-24 14:06:44 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='./qwen2-vl-instruct-pytorch-7b', speculative_config=None, tokenizer='./qwen2-vl-instruct-pytorch-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8784, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=./qwen2-vl-instruct-pytorch-7b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 01-24 14:06:48 selector.py:134] Using ROCmFlashAttention backend.
INFO 01-24 14:06:48 model_runner.py:1094] Starting to load model ./qwen2-vl-instruct-pytorch-7b...
WARNING 01-24 14:06:48 registry.py:307] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:01<00:07, 1.87s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:06<00:10, 3.39s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:07<00:04, 2.16s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:11<00:03, 3.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:17<00:00, 4.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:17<00:00, 3.53s/it]
INFO 01-24 14:07:06 model_runner.py:1099] Loading model weights took 15.5083 GB
WARNING 01-24 14:07:06 model_runner.py:1279] Computed max_num_seqs (min(256, 8784 // 81920)) to be less than 1. Setting it to the minimum value of 1.
Token indices sequence length is longer than the specified maximum sequence length for this model (65536 > 32768). Running this sequence through the model will result in indexing errors
WARNING 01-24 14:07:10 processing.py:878] The context length (8784) of the model is too short to hold the multi-modal embeddings in the worst case (65536 tokens in total, out of which {'image': 65536} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument
After I turn on export MIOPEN_ENABLE_LOGGING=1 and export MIOPEN_ENABLE_LOGGING_CMD=1, it shows:
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP): tensorDesc = 0x7f563b65ee90
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP): tensorDesc = {}, {}, packed,
MIOpen(HIP): dataType = 5
MIOpen(HIP): nbDims = 5
MIOpen(HIP): dim.values = { 262144 3 2 14 14 }
MIOpen(HIP): stride.values = { 1176 392 196 14 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP): tensorDesc = 0xe00000002
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP): tensorDesc = {}, {}, packed,
MIOpen(HIP): dataType = 5
MIOpen(HIP): nbDims = 5
MIOpen(HIP): dim.values = { 1280 3 2 14 14 }
MIOpen(HIP): stride.values = { 1176 392 196 14 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP): tensorDesc = 0x7f563b679c6c
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP): tensorDesc = {}, {}, packed,
MIOpen(HIP): dataType = 5
MIOpen(HIP): nbDims = 5
MIOpen(HIP): dim.values = { 262144 1280 1 1 1 }
MIOpen(HIP): stride.values = { 1280 1 1 1 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateConvolutionDescriptor(miopenConvolutionDescriptor_t *){
MIOpen(HIP): convDesc = 0x7fffa5ad3c80
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenInitConvolutionNdDescriptor(miopenConvolutionDescriptor_t, int, const int *, const int *, const int *, miopenConvolutionMode_t){
MIOpen(HIP): convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP): spatialDim = 3
MIOpen(HIP): pads = { 0 0 0 }
MIOpen(HIP): strides = { 2 14 14 }
MIOpen(HIP): dilations = { 1 1 1 }
MIOpen(HIP): c_mode = 0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetConvolutionGroupCount(miopenConvolutionDescriptor_t, int){
MIOpen(HIP): convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP): groupCount = 1
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetConvolutionAttribute(miopenConvolutionDescriptor_t, const miopenConvolutionAttrib_t, const int){
MIOpen(HIP): convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP): attr = 1
MIOpen(HIP): value = 0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenConvolutionForwardGetWorkSpaceSize(miopenHandle_t, const miopenTensorDescriptor_t, const miopenTensorDescriptor_t, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, size_t *){
MIOpen(HIP): handle = stream: 0, device_id: 0
MIOpen(HIP): wDesc = {1280, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP): xDesc = {262144, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP): convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP): yDesc = {262144, 1280, 1, 1, 1}, {1280, 1, 1, 1, 1}, packed,
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenFindConvolutionForwardAlgorithm(miopenHandle_t, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, void *, const int, int *, miopenConvAlgoPerf_t *, void *, size_t, bool){
MIOpen(HIP): handle = stream: 0, device_id: 0
MIOpen(HIP): xDesc = {262144, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP): x = 0x7f4e03600000
MIOpen(HIP): wDesc = {1280, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP): w = 0x7f52cb400000
MIOpen(HIP): convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP): yDesc = {262144, 1280, 1, 1, 1}, {1280, 1, 1, 1, 1}, packed,
MIOpen(HIP): y = 0x7f4ddb400000
MIOpen(HIP): requestAlgoCount = 1
MIOpen(HIP): returnedAlgoCount = 32767
MIOpen(HIP): perfResults =
MIOpen(HIP): workSpace = 0x7f52c51ad600
MIOpen(HIP): workSpaceSize = 2352
MIOpen(HIP): exhaustiveSearch = 0
MIOpen(HIP): }
MIOpen(HIP): Command [LogCmdFindConvolution] ./bin/MIOpenDriver convbfp16 -n 262144 -c 3 --in_d 2 -H 14 -W 14 -k 1280 --fil_d 2 -y 14 -x 14 --pad_d 0 -p 0 -q 0 --conv_stride_d 2 -u 14 -v 14 --dilation_d 1 -l 1 -j 1 --spatial_dim 3 -m conv -g 1 -F 1 -t 1
MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument
MIOpen(HIP): auto miopen::solver::conv::GemmFwdRest::GetSolution(const ExecutionContext &, const ProblemDescription &)::(anonymous class)::operator()(const std::vector<Kernel> &)::(anonymous class)::operator()(const Handle &, const AnyInvokeParams &) const{
MIOpen(HIP): name + ", non 1x1" = convolution, non 1x1
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP): "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP): "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP): "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP): "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP): "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP): "rocBLAS" = rocBLAS
MIOpen(HIP): }
And it always shows MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){ MIOpen(HIP): "rocBLAS" = rocBLAS MIOpen(HIP): } without stopping. And the GPU utils is steadly at 95%.
========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
^[3m (DID, GUID) (Edge) (Avg) (Mem, Compute, ID) ^[0m
====================================================================================================================
0 1 0x744c, 33510 40.0°C 182.0W N/A, N/A, 0 3119Mhz 96Mhz 0% auto 327.0W 81% 96%
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================
$ rocminfo
ROCk module version 6.8.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
Uuid: CPU-XX
Marketing Name: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4700
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32799352(0x1f47a78) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32799352(0x1f47a78) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32799352(0x1f47a78) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1100
Uuid: GPU-85631fd855c9cea1
Marketing Name: Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2482
BDFID: 768
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 342
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Hi @Looong01. Internal ticket has been created to investigate your issue. Thanks!
Hi @Looong01 , thank you for posting the issue. Can you please provide more info as following?
- The rocm version. using amd-smi
- The version info for vLLM. which branch in https://github.com/ROCm/vllm?
- The link to download the qwen model you used (./qwen2-vl-instruct-pytorch-7b)
Thanks.
Hi @Looong01 , thank you for posting the issue. Can you please provide more info as following?
- The rocm version. using amd-smi
- The version info for vLLM. which branch in https://github.com/ROCm/vllm?
- The link to download the qwen model you used (./qwen2-vl-instruct-pytorch-7b)
Thanks.
$ sudo amd-smi
usage: amd-smi [-h] ...
AMD System Management Interface | Version: 24.6.3+9578815 | ROCm version: 6.2.4 |
Platform: Linux Baremetal
-
The latest version, and the main branch.
-
https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
Hi @Looong01 can you please try a recent docker:
docker pull rocm/vllm-dev:navi_nightly_main_20250120
I tested with the qwen2 vl model, works on my 7900xtx:
vllm serve Qwen/Qwen2-VL-7B-Instruct --max_model_len 8784 --gpu_memory_utilization 0.9
To build the navi docker, please follow: https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html?device=rocm
Let us know your test results.
@Looong01. Hope you are able to resolve your issue with the latest docker. I am closing the ticket but feel free to comment or open another ticket if you are still experiencing any issues. Thanks