MNN icon indicating copy to clipboard operation
MNN copied to clipboard

Qwen3-4B crashes at "Prepare for tuning opt" with QNN HTP backend on Windows ARM64

Open ITM527 opened this issue 1 month ago • 12 comments

We have followed the LLM-MNN documentation for execution of Qwen3-4B-Instruct-2507 LLM on Qualcomm NPU. Host Device: Linux (Ubuntu 22.04) Target Device: Windows ARM64 (Snapdragon X Elite) QAIRT SDK: v2.38.0.250901

Steps followed:

  1. Loaded 'Qwen3-4B-Instruct-2507' model from Hugging Face.
  2. Exported to MNN using 'llmexport.py' while quantizing to 4-bit and seperate_embed = TRUE. (We tried with seperate_embed=FALSE as well).
  3. Build MNN on host device using the flags as specified in the doc (-DMNN_QNN=ON, -DMNN_QNN_CONVERT_MODE=ON, -DMNN_WITH_PLUGIN=OFF, -DMNN_BUILD_TOOLS=ON ), and QNN_SDK_ROOT path is set to QAIRT SDK directory)
  4. After successful conversion build, we have used the script generate_llm_qnn.py to convert it to context binary graphs for NPU compatibility. In this process we got a few errors at step-2. as follows:
 Load Cache file error.                                                                                                                     
Broad cast error, dim1 = 1024, dim2 = 0                                                                                                    
Compute Shape Error for /Add_3_output_0                                                                                                                                                                                                                                               
Load Cache file error.                                                                                                                                                                                                                                                                
Broad cast error, dim1 = 1024, dim2 = 0                                                                                                                                                                                                                                               
Compute Shape Error for /Add_8_output_0                                                                                                    
Load Cache file error.                                                                                                                                                                                                                                                                
Broad cast error, dim1 = 1024, dim2 = 0                                                                                                                                                                                                                                               
Compute Shape Error for /Add_13_output_0                                                                                                                                                                                                                                              
Load Cache file error.                                                                                                                                                                                                                                                                
Broad cast error, dim1 = 1024, dim2 = 0                              
Compute Shape Error for /Add_18_output_0                                                                                                   
Load Cache file error.       

Despite that all the further steps are successfully completed and 38 '.bin' files along with config_qnn.json and llm.mnn are created as mentioned in doc.

  1. Now, on target device we build MNN using the flags for inference of offline graph as suggested: -DMNN_QNN=ON, -DMNN_QNN_CONVERT_MODE=OFF, -DMNN_WITH_PLUGIN=ON.

  2. When using llm_demo.exe, it is running perfectly fine on cpu with command 'llm_demo.exe config.json'. But when tried to run the same utilizing NPU 'llm_demp.exe config_qnn.json' it crashes:

PS C:\Models\Qwen3-4B-Instruct-2507-MNN> C:\Models\MNN\build_qnn_device\Release\llm_demo.exe .\config_qnn.json
The device supports: i8sdot:0, fp16:0, i8mm: 0, sve2: 0, sme2: 0
config path is .\config_qnn.json
main, 266, cost time: 3350.492920 ms
Prepare for tuning opt Begin
PS C:\Models\Qwen3-4B-Instruct-2507-MNN>

During this process, on monitoring we observed a sudden spike in NPU usage by upto 30% and then suddenly back to 0% when app crashes.

The crash report is as follow:

Faulting application name: llm_demo.exe, version: 0.0.0.0, time stamp: 0x691af709
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x0000000000000000
Faulting process id: 0x70D4
Faulting application start time: 0x1DC57AD30554F39
Faulting application path: C:\Models\MNN\build_qnn_device\Release\llm_demo.exe
Faulting module path: unknown
Report Id: c52808dd-a2ff-43bd-9cfe-08032637c780
Faulting package full name: 
Faulting package-relative application ID:

ALTERNATIVE APPROACH: To avoid the broadcast error and following the documentation statement: "Supports inference for conventional models of static shapes/finite shape combinations". `We modified the export script and fixed the MAX_HISTORY_LENGTH/ CONTEXT_LENGTH to ensure that it remains static throughout. As a result the broadcast error is resolved but now in step 'Compiling QNN', it throws an error because now splits are not into 38 graphs but into 2 graphs only which are large in size and might not be compatible for QNN execution.

Note:

  • On target device, llm_demo.exe works perfectly on CPU but llm_bench.exe failed. But on host device both works perfectly on CPU backend.
  • Difference chunk sizes are tried, but all results in same crash.

Is there compatibility issue with Windows ARM64 devices which causing llm_bench.exe to fail and NPU crashing even in llm_demo.exe?

ITM527 avatar Nov 22 '25 21:11 ITM527

Have you pushed QNN SDK libary to the device?And the broadcast error can be ignored.

Qxinyu avatar Nov 25 '25 07:11 Qxinyu

@Qxinyu Yes, we have installed SDK compatible with the mentioned target device (same version as used in host device for conversion). The QNN SDK path was correctly set in the working environment. Additionally, we also tried by pushing the necessary SDK libraries to working directory as well but it still crashes on execution of 'llm_demo.exe'.

ITM527 avatar Nov 25 '25 07:11 ITM527

You can modify line 81 in /source/backend/qnn/backend/QNNBackend.cpp from: if ((QNN_GET_ERROR_CODE(qnnInterface.logCreate(logCallback, QNN_LOG_LEVEL_ERROR, &logHandle)) != QNN_SUCCESS) || to: if ((QNN_GET_ERROR_CODE(qnnInterface.logCreate(logCallback, QNN_LOG_LEVEL_DEBUG, &logHandle)) != QNN_SUCCESS) || and then capture the logs during execution for more detailed diagnostics.

Qxinyu avatar Nov 25 '25 08:11 Qxinyu

@Qxinyu Thanks for your suggestion. We have modified the line 81 in /source/backend/qnn/backend/QNNBackend.cpp and rebuild MNN with appropriate flags.

The detailed diagnostic is as follows:

C:\Models\MNN\build_git\Release>llm_demo.exe C:\Models\Qwen3-4B-Instruct-2507-MNN\config_qnn.json
The device supports: i8sdot:0, fp16:0, i8mm: 0, sve2: 0, sme2: 0
config path is C:\Models\Qwen3-4B-Instruct-2507-MNN\config_qnn.json
main, 266, cost time: 3341.393066 ms
Prepare for tuning opt Begin
 <I> QnnLog_create started.
 <V> Registered a new graph environment 0 with priority: 100, num hvx threads: 1001, num hmx threads: 1001
 <W> Initializing HtpProvider
 <V> Creating default router
 <V> RouterWindows creater
 <V> HTP: Initializing the router
 <V> Detected Snapdragon SOC Dynamic SDM with 1 SOCs
 <V> Allocating PlatformInfo struct size 120
 <V> Multicore support is unavailable
 <V> Force to use single core in default platformInfo when MultiCore is not supported, numHwDevices= 1
 <V> HTP: Initializing the graph registry
 <V> HTP: Initializing the context registry
 <V> HTP: Initializing the device registry
 <V> HTP: Initializing the tensor counter
 <V> HTP: setting isExitCalled to false
 <V> HTP: setting ssrInProgress to false
 <V> HTP: FinalCleanupFn fnPtr is nullptr
 <V> HTP: initializing mem registry
 <V> HTP: initializing mmap registry
 <V> HTP: initializing graphId registry
 <V> HTP: initializing va map fd registry
 <V> HTP: initializing graph va map fd registry
 <V> HTP: initializing debugFactory
 <V> HTP: Initializing the logger lifecycle manager
 <V> HTP: constructing bundle
 <V> Graph environment handle not opened as preparelib or driverlib is not yet loaded
 <V> Set default graph environment 0 remoteHandle 0
 <V> Opened default graph env, envRemoteHandle 0
 <I> exit with 0
 <I> exit with 0
 <V> HTP: initialization completed successfully
 <I> QnnLog_create exit.
 <I> QnnBackend_create started. backend = 0x5ffeb78
 <V> Oem key validation infra not found
 <V> Backend handle created: 1
 <V> Graph environment handle not opened as preparelib or driverlib is not yet loaded
 <V> Set default graph environment 0 remoteHandle 0
 <V> Opened default graph env, envRemoteHandle 0
 <I> QnnBackend_create done successfully. backend = 0x5ffeb78
 <I> QnnDevice_create started
 <V> Create device with id 0x1
 <V> Config not passed. Loading default platform info!
 <V> Setting default value for unsigned PD usage
 <V> DSP Driver Path: C:\WINDOWS\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79
 <I> First connection to QNN stub established!
 <V> Loading remote funcs
 <V> Getting effective domain ID of domain name cdsp
 <V> Effective cdsp_id is: 3, Session_id is: 0 for original Device Id: 0, DeviceId: 0, CoreId: 0, pdId: 0
 <E> DspTransport.openSession qnn_open failed, 0x80000406, prio 100
 <E> IDspTransport: Unable to load lib 0x80000406
 <E> DspTransport.getHandle failed, error 0x00000008
 <E> createDspTransportInstance failed to config transport object
 <I> queuesClose : SingleCoreSession already destroyed
 <E> error in creation of transport instance
 <W> Failed to create transport instance: 1002
 <W> Failed to load skel, error: 1002
 <W> Traditional path not available. Switching to user driver path
 <V> DriverLibLoader Loading HtpUsrDrv.dll
 <V> HTP User Driver Path: C:\WINDOWS\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79/HTP
 <V> Max API version supported by the driver = 1.4.2
 <V> Min API version supported by the driver = 1.0.0
 <V> QNN side interface version = 1.5.17
 <V> Driver interface requested size 520, filled 352
 <V> Driver capabilities size requested 216 size filled 116
 <V> Initializeing OpPackageManager log callback in HtpUsrDrv_setLogCallback
 <V> HtpUsrDrv_setLogLevel is called
 <V> Driver log level is set as: 5
 <V> HtpUsrDrv_setProfileCallback is called
 <V> Setting profile extended callback
 <V> HtpUsrDrv_getConfig is called
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2015
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2004
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2005
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2006
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2007
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2008
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2009
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2010
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2011
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2012
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2013
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2014
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2016
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2017
 <W> Incompatible profiling event. Consider upgrading Driver to support new profiling events - 2018
 <V> HtpUsrDrv_getBuildId is called
 <V> Driver build id: v2.30.2.250124135729_113467
 <W> HTP user driver is loaded. Switched to user driver path
 <V> Calling driver's API - deviceCreate
 <V> Calling transport createDeviceTransportInstance from driver
 <V> skel file path file:///C:\WINDOWS\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79\HTP\libQnnHtpV73SkelDrv.so?qnn_skel_handle_invoke&_modver=1.0&_dom=cdspDrv.so?qnn_skel_handle_invoke&_modver=1.0&_dom=cdsp
 <V> DSP Driver Path: C:\WINDOWS\System32\DriverStore\FileRepository\qcnspmcdm8380.inf_arm64_b31b1d855e0f5f79
 <I> First connection to QNN stub established!
 <V> Loading remote funcs
 <V> Getting effective domain ID of domain name cdsp
 <V> Effective cdsp_id is: 3, Session_id is: 0 for DeviceId: 0, CoreId: 0, pdId: 0
 <V> Transport session for deviceId 268435456 coreId 0 pdId 0 not found!
 <V> DeviceId 268435456 coreId 0 pdId 0 not present, insert a new entry 0000026AB0ECC3A0
 <V> rpcMemoryInit exits with 2, successfully initialized rpc memory
 <V> Successful rpcMemInit
 <V> rpcMemoryAlloc: 8 isInit 1
 <V> rpcMemoryAlloc: 136 isInit 1
 <D> Calling RPC transport with params 0000026AAFB40000 [8 B], 0000000000000000 [0 B], 0000026AAFB50000 [88 B]
 <V> Found transport session 0000026AB0ECC3A0 for deviceId 268435456 coreId 0 pdId 0!
 <D> qnn_transport_run time: 5 (ms)

 <V> rpcMemoryAlloc: 8 isInit 1
 <V> rpcMemoryAlloc: 8 isInit 1
 <D> Calling RPC transport with params 0000026AAFB40000 [8 B], 0000000000000000 [0 B], 0000026AAFB50000 [8 B]
 <V> Found transport session 0000026AB0ECC3A0 for deviceId 268435456 coreId 0 pdId 0!
 <D> qnn_transport_run time: 0 (ms)

 <V> New session config entry is found, value = 1
 <V> New session config value = 1
 <V> exits device initialization with  0
 <V> Calling driver's API - createGraphEnvHandle
 <V> Graph environments is not supported by current User Driver. Default environment will be used.
 <V> Set default graph environment 0 remoteHandle 0
 <V> Opened default graph env, envRemoteHandle 0
 <V> Calling driver's API - createGraphEnvHandle
 <V> Graph environments is not supported by current User Driver. Default environment will be used.
 <V> Successfully opened graph env handle, envId 0
 <V> Successfully opened graph environment, envId 0
 <V> Calling driver's API - setSkelLogLevel
 <V> HtpUsrDrv_setLogLevel is called
 <V> Setting skel log level from driver
 <V> Found transport session 0000026AB0ECC3A0 for deviceId 268435456 coreId 0 pdId 0!
 <D> qnn_transport_run time: 0 (ms)

 <V> setSkelLogLevel return 0
 <V> Setting OpPackageManager log level from driver
 <I> QnnDevice_create done. device = 0x1. status 0x0
 <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API
 <V> OpPackage log is handled by User Driver now, please call set log level API to set it
 <I> QnnDevice_getPlatformInfo started.
 <I> QnnDevice_getPlatformInfo done. status 0x0
 <V> OpPackage log is handled by User Driver now, nothing happens when calling terminate OpPackage log API

ITM527 avatar Nov 25 '25 08:11 ITM527

"<E> DspTransport.openSession qnn_open failed, 0x80000406, prio 100" "<E> IDspTransport: Unable to load lib 0x80000406" "<E> DspTransport.getHandle failed, error 0x00000008" "<E> createDspTransportInstance failed to config transport object" "<I> queuesClose : SingleCoreSession already destroyed" "<E> error in creation of transport instance" It seems the device couldn't find the library—perhaps the environment hasn't been set up properly.

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-10/QNN_general_overview.html Have you set the correct soc_id and dsp_arch as shown on the website above? We haven't tried running offline models on Qualcomm platforms under Windows yet.

Qxinyu avatar Nov 25 '25 09:11 Qxinyu

Yes @Qxinyu , We have set the correct 'soc_id' and 'dsp_arch' (60 and v73 respectively) as mentioned in the shared link.

Following the above highlighted error, it seems to have recovered as well: "<W> Traditional path not available. Switching to user driver path" "<V> DriverLibLoader Loading HtpUsrDrv.dll" "<W> HTP user driver is loaded. Switched to user driver path"

" <I> QnnDevice_create done. device = 0x1. status 0x0"

Additionally, in environment, we have set the respective paths as mentioned in inference doc, "npu.md", for 'QNN_SDK_ROOT', 'QNN_ROOT' and 'HEXAGON_SDK_ROOT' while our working directory is 'build' folder inside cloned MNN.

It will be helpful if you could please try out the Qwen3-4B-Instruct model on Qualcomm platform under any Windows device or if there's any future plan to smoothen/test compatibility for the same.

Do you have a test application or project that we can deploy on Android to see if it works with NPU or not?

ITM527 avatar Nov 25 '25 09:11 ITM527

Is the application still crashing now? We will test on Qualcomm platforms with Windows. On the Android platform, we're also using llm_demo for testing.

Qxinyu avatar Nov 26 '25 02:11 Qxinyu

@Qxinyu Yes, it is still crashing. It will be helpful if you could please test on Qualcomm platforms with Windows at your end and fix this (in case of any issue) or highlight if there's something we missing in our execution. Thank you.

ITM527 avatar Nov 26 '25 05:11 ITM527

@Qxinyu Just wanted to follow-up on the above discussed issue, in case you had a chance to look into it?

ITM527 avatar Dec 01 '25 05:12 ITM527

@Qxinyu Just an update, we have tried running it on android NPU (Snapdragon 8 Gen3), by following similar steps as in case of Windows and it worked perfectly. Both executables 'llm_demo' and 'llm_bench' are working without any error/issue. Hoping for a similar behaviour on Windows device as well, would be great if you can look into it.

ITM527 avatar Dec 02 '25 10:12 ITM527

Compute Shape Error for qnn/graph0.bin 请问下这个是为什么呢

zx104972 avatar Dec 05 '25 11:12 zx104972

Compute Shape Error for qnn/graph0.bin 请问下这个是为什么呢

你出错的模型是哪个呢,现在运行qwen3-vl的模型还有点问题,生成的模型会出现你这个问题,我们后续会更新转换工具。

Qxinyu avatar Dec 08 '25 02:12 Qxinyu

@Qxinyu Just an update, we have tried running it on android NPU (Snapdragon 8 Gen3), by following similar steps as in case of Windows and it worked perfectly. Both executables 'llm_demo' and 'llm_bench' are working without any error/issue. Hoping for a similar behaviour on Windows device as well, would be great if you can look into it.

hello,i just tried to run Qwen3-4b on android NPU (Snapdragon 8 Gen Elite) using "python3 ~/MNN/transformers/llm/export/npu/generate_llm_qnn.py --model ~/models/mnn/qwen3_1_7b --soc_id=69 --dsp_arch=v79", but i still met the same problem.

  1. When i tried to export QNN model from MNN, i met the Broadcast error Load Cache file error.
    Broad cast error, dim1 = 1024, dim2 = 0
  2. If i ignore the error above and continue, when i push all the file into mobile devie,and run ./llm_demo model/qwen3_4b/qnn_config.json, the program just crashed with "Segmentation fault" without other message Could you help me how to fix it ?

TheLogan6 avatar Dec 11 '25 12:12 TheLogan6

@Qxinyu Just an update, we have tried running it on android NPU (Snapdragon 8 Gen3), by following similar steps as in case of Windows and it worked perfectly. Both executables 'llm_demo' and 'llm_bench' are working without any error/issue. Hoping for a similar behaviour on Windows device as well, would be great if you can look into it.

hello,i just tried to run Qwen3-4b on android NPU (Snapdragon 8 Gen Elite) using "python3 ~/MNN/transformers/llm/export/npu/generate_llm_qnn.py --model ~/models/mnn/qwen3_1_7b --soc_id=69 --dsp_arch=v79", but i still met the same problem.

  1. When i tried to export QNN model from MNN, i met the Broadcast error Load Cache file error. Broad cast error, dim1 = 1024, dim2 = 0
  2. If i ignore the error above and continue, when i push all the file into mobile devie,and run ./llm_demo model/qwen3_4b/qnn_config.json, the program just crashed with "Segmentation fault" without other message Could you help me how to fix it ?

Have you specified the --separate-embed flag when exporting the model? It requires the embeddings_bf16.bin file to run with QNN.

Qxinyu avatar Dec 11 '25 12:12 Qxinyu