MIOpen MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument

Hi,

I am running VLLM on my 7900XTX(gfx1100). I use vllm serve ./qwen2-vl-instruct-pytorch-7b --dtype auto --port 8000 --limit_mm_per_prompt image=4 --max_model_len 8784 --gpu_memory_utilization 0.9

But then it shows errors:

$ vllm serve ./qwen2-vl-instruct-pytorch-7b --dtype auto --port 8000 --limit_mm_per_prompt image=4 --max_model_len 8784 --gpu_memory_utilization 0.9
WARNING 01-24 14:06:31 rocm.py:31] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
INFO 01-24 14:06:32 api_server.py:712] vLLM API server version 0.6.6.post1
INFO 01-24 14:06:32 api_server.py:713] args: Namespace(subparser='serve', model_tag='./qwen2-vl-instruct-pytorch-7b', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='./qwen2-vl-instruct-pytorch-7b', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8784, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt={'image': 4}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f04d0f26f80>)
INFO 01-24 14:06:32 api_server.py:199] Started engine process with PID 91849
INFO 01-24 14:06:40 config.py:510] This model supports multiple tasks: {'reward', 'score', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 01-24 14:06:40 config.py:1338] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 01-24 14:06:44 config.py:510] This model supports multiple tasks: {'score', 'classify', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 01-24 14:06:44 config.py:1338] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 01-24 14:06:44 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='./qwen2-vl-instruct-pytorch-7b', speculative_config=None, tokenizer='./qwen2-vl-instruct-pytorch-7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8784, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=./qwen2-vl-instruct-pytorch-7b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 01-24 14:06:48 selector.py:134] Using ROCmFlashAttention backend.
INFO 01-24 14:06:48 model_runner.py:1094] Starting to load model ./qwen2-vl-instruct-pytorch-7b...
WARNING 01-24 14:06:48 registry.py:307] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0`
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:01<00:07,  1.87s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:06<00:10,  3.39s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:07<00:04,  2.16s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:11<00:03,  3.25s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:17<00:00,  4.13s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:17<00:00,  3.53s/it]

INFO 01-24 14:07:06 model_runner.py:1099] Loading model weights took 15.5083 GB
WARNING 01-24 14:07:06 model_runner.py:1279] Computed max_num_seqs (min(256, 8784 // 81920)) to be less than 1. Setting it to the minimum value of 1.
Token indices sequence length is longer than the specified maximum sequence length for this model (65536 > 32768). Running this sequence through the model will result in indexing errors
WARNING 01-24 14:07:10 processing.py:878] The context length (8784) of the model is too short to hold the multi-modal embeddings in the worst case (65536 tokens in total, out of which {'image': 65536} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument

After I turn on export MIOPEN_ENABLE_LOGGING=1 and export MIOPEN_ENABLE_LOGGING_CMD=1, it shows:

MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0x7f563b65ee90
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 5
MIOpen(HIP):    nbDims = 5
MIOpen(HIP):    dim.values = { 262144 3 2 14 14 }
MIOpen(HIP):    stride.values = { 1176 392 196 14 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0xe00000002
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 5
MIOpen(HIP):    nbDims = 5
MIOpen(HIP):    dim.values = { 1280 3 2 14 14 }
MIOpen(HIP):    stride.values = { 1176 392 196 14 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateTensorDescriptor(miopenTensorDescriptor_t *){
MIOpen(HIP):    tensorDesc = 0x7f563b679c6c
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetTensorDescriptor(miopenTensorDescriptor_t, miopenDataType_t, int, const int *, const int *){
MIOpen(HIP):    tensorDesc = {}, {}, packed,
MIOpen(HIP):    dataType = 5
MIOpen(HIP):    nbDims = 5
MIOpen(HIP):    dim.values = { 262144 1280 1 1 1 }
MIOpen(HIP):    stride.values = { 1280 1 1 1 1 }
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenCreateConvolutionDescriptor(miopenConvolutionDescriptor_t *){
MIOpen(HIP):    convDesc = 0x7fffa5ad3c80
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenInitConvolutionNdDescriptor(miopenConvolutionDescriptor_t, int, const int *, const int *, const int *, miopenConvolutionMode_t){
MIOpen(HIP):    convDesc = conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1},
MIOpen(HIP):    spatialDim = 3
MIOpen(HIP):    pads = { 0 0 0 }
MIOpen(HIP):    strides = { 2 14 14 }
MIOpen(HIP):    dilations = { 1 1 1 }
MIOpen(HIP):    c_mode = 0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetConvolutionGroupCount(miopenConvolutionDescriptor_t, int){
MIOpen(HIP):    convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP):    groupCount = 1
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenSetConvolutionAttribute(miopenConvolutionDescriptor_t, const miopenConvolutionAttrib_t, const int){
MIOpen(HIP):    convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP):    attr = 1
MIOpen(HIP):    value = 0
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenConvolutionForwardGetWorkSpaceSize(miopenHandle_t, const miopenTensorDescriptor_t, const miopenTensorDescriptor_t, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, size_t *){
MIOpen(HIP):    handle = stream: 0, device_id: 0
MIOpen(HIP):    wDesc = {1280, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP):    xDesc = {262144, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP):    convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP):    yDesc = {262144, 1280, 1, 1, 1}, {1280, 1, 1, 1, 1}, packed,
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopenFindConvolutionForwardAlgorithm(miopenHandle_t, const miopenTensorDescriptor_t, const void *, const miopenTensorDescriptor_t, const void *, const miopenConvolutionDescriptor_t, const miopenTensorDescriptor_t, void *, const int, int *, miopenConvAlgoPerf_t *, void *, size_t, bool){
MIOpen(HIP):    handle = stream: 0, device_id: 0
MIOpen(HIP):    xDesc = {262144, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP):    x = 0x7f4e03600000
MIOpen(HIP):    wDesc = {1280, 3, 2, 14, 14}, {1176, 392, 196, 14, 1}, packed,
MIOpen(HIP):    w = 0x7f52cb400000
MIOpen(HIP):    convDesc = conv3d, miopenConvolution, miopenPaddingDefault, {0, 0, 0}, {2, 14, 14}, {1, 1, 1},
MIOpen(HIP):    yDesc = {262144, 1280, 1, 1, 1}, {1280, 1, 1, 1, 1}, packed,
MIOpen(HIP):    y = 0x7f4ddb400000
MIOpen(HIP):    requestAlgoCount = 1
MIOpen(HIP):    returnedAlgoCount = 32767
MIOpen(HIP):    perfResults =
MIOpen(HIP):    workSpace = 0x7f52c51ad600
MIOpen(HIP):    workSpaceSize = 2352
MIOpen(HIP):    exhaustiveSearch = 0
MIOpen(HIP): }
MIOpen(HIP): Command [LogCmdFindConvolution] ./bin/MIOpenDriver convbfp16 -n 262144 -c 3 --in_d 2 -H 14 -W 14 -k 1280 --fil_d 2 -y 14 -x 14 --pad_d 0 -p 0 -q 0 --conv_stride_d 2 -u 14 -v 14 --dilation_d 1 -l 1 -j 1 --spatial_dim 3 -m conv -g 1 -F 1 -t 1
MIOpen(HIP): Error [EvaluateInvokers] /MIOpen/src/hipoc/hipoc_kernel.cpp:106: Failed to launch kernel: invalid configuration argument
MIOpen(HIP): auto miopen::solver::conv::GemmFwdRest::GetSolution(const ExecutionContext &, const ProblemDescription &)::(anonymous class)::operator()(const std::vector<Kernel> &)::(anonymous class)::operator()(const Handle &, const AnyInvokeParams &) const{
MIOpen(HIP):    name + ", non 1x1" = convolution, non 1x1
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){
MIOpen(HIP):    "rocBLAS" = rocBLAS
MIOpen(HIP): }

And it always shows MIOpen(HIP): miopenStatus_t miopen::CallGemm(const Handle &, GemmDescriptor, ConstData_t, std::size_t, ConstData_t, std::size_t, Data_t, std::size_t, GemmBackend_t){ MIOpen(HIP): "rocBLAS" = rocBLAS MIOpen(HIP): } without stopping. And the GPU utils is steadly at 95%.

Jan 24 '25 14:01 Looong01

========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device  Node  IDs              Temp    Power   Partitions          SCLK     MCLK   Fan  Perf  PwrCap  VRAM%  GPU%
^[3m              (DID,     GUID)  (Edge)  (Avg)   (Mem, Compute, ID)                                                  ^[0m
====================================================================================================================
0       1     0x744c,   33510  40.0°C  182.0W  N/A, N/A, 0         3119Mhz  96Mhz  0%   auto  327.0W  81%    96%
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================

Jan 24 '25 14:01 Looong01

$ rocminfo
ROCk module version 6.8.5 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  Uuid:                    CPU-XX
  Marketing Name:          Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4700
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            8
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    32799352(0x1f47a78) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32799352(0x1f47a78) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    32799352(0x1f47a78) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-85631fd855c9cea1
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2482
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 342
  SDMA engine uCode::      21
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Jan 24 '25 14:01 Looong01

Hi @Looong01. Internal ticket has been created to investigate your issue. Thanks!

Jan 24 '25 14:01 ppanchad-amd

Hi @Looong01 , thank you for posting the issue. Can you please provide more info as following?

The rocm version. using amd-smi
The version info for vLLM. which branch in https://github.com/ROCm/vllm?
The link to download the qwen model you used (./qwen2-vl-instruct-pytorch-7b)

Thanks.

Jan 24 '25 18:01 huanrwan-amd

Hi @Looong01 , thank you for posting the issue. Can you please provide more info as following?

The rocm version. using amd-smi

The version info for vLLM. which branch in https://github.com/ROCm/vllm?

The link to download the qwen model you used (./qwen2-vl-instruct-pytorch-7b)

Thanks.

$ sudo amd-smi
usage: amd-smi [-h]  ...

AMD System Management Interface | Version: 24.6.3+9578815 | ROCm version: 6.2.4 |
Platform: Linux Baremetal

The latest version, and the main branch.
https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

Jan 24 '25 23:01 Looong01

Hi @Looong01 can you please try a recent docker: docker pull rocm/vllm-dev:navi_nightly_main_20250120 I tested with the qwen2 vl model, works on my 7900xtx: vllm serve Qwen/Qwen2-VL-7B-Instruct --max_model_len 8784 --gpu_memory_utilization 0.9

To build the navi docker, please follow: https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html?device=rocm

Let us know your test results.

Jan 31 '25 22:01 huanrwan-amd

@Looong01. Hope you are able to resolve your issue with the latest docker. I am closing the ticket but feel free to comment or open another ticket if you are still experiencing any issues. Thanks

Apr 29 '25 14:04 ppanchad-amd