executorch prefill model

Summary:

repro command

python -m executorch.examples.models.llama2.export_llama --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w

Pass in 2.25 but fails in 2.26

Segfault error stacktrace

[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in SAVE MODE.
[WARNING] [Qnn ExecuTorch]: Qnn API version 2.19.0 is used. The version is tested against 2.18.0.
[INFO] [Qnn ExecuTorch]: Running level=3 optimization.
AddressSanitizer:DEADLYSIGNAL
=================================================================
==1523599==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000020 (pc 0x7f1585ee38e2 bp 0x7f16d5ab8800 sp 0x7ffed19ab8b0 T0)
==1523599==The signal is caused by a READ memory access.
==1523599==Hint: address points to the zero page.
SCARINESS: 10 (null-deref)
    #0 0x7f1585ee38e2  (/home/chenlai/fbsource/third-party/qualcomm/qnn/qnn-2.26/lib/x86_64-linux-clang/libQnnHtp.so+0x2ce38e2) (BuildId: bc3ab8ddc89a0e65)
    #1 0x7f1585dd8926  (/home/chenlai/fbsource/third-party/qualcomm/qnn/qnn-2.26/lib/x86_64-linux-clang/libQnnHtp.so+0x2bd8926) (BuildId: bc3ab8ddc89a0e65)
    #2 0x7f15844d1161  (/home/chenlai/fbsource/third-party/qualcomm/qnn/qnn-2.26/lib/x86_64-linux-clang/libQnnHtp.so+0x12d1161) (BuildId: bc3ab8ddc89a0e65)
    #3 0x7f15844dcac6  (/home/chenlai/fbsource/third-party/qualcomm/qnn/qnn-2.26/lib/x86_64-linux-clang/libQnnHtp.so+0x12dcac6) (BuildId: bc3ab8ddc89a0e65)
    #4 0x7f15844d245b  (/home/chenlai/fbsource/third-party/qualcomm/qnn/qnn-2.26/lib/x86_64-linux-clang/libQnnHtp.so+0x12d245b) (BuildId: bc3ab8ddc89a0e65)
    #5 0x7f15b9bc7b21 in auto torch::executor::qnn::QnnInterface::qnn_backend_validate_op_config<void*, Qnn_OpConfig_t>(void*, Qnn_OpConfig_t) const fbcode/executorch/backends/qualcomm/runtime/backends/QnnFunctionInterface.h:39
    #6 0x7f15b9bc7682 in torch::executor::qnn::QnnBackend::BackendValidateOpConfig(Qnn_OpConfig_t const&) fbcode/executorch/backends/qualcomm/runtime/backends/QnnBackendCommon.h:41
    #7 0x7f15b9bc7115 in torch::executor::qnn::QnnManager::IsNodeSupportedByBackend(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&) fbcode/executorch/backends/qualcomm/runtime/QnnManager.cpp:450
    #8 0x7f15b9dd44ee in torch::executor::qnn::PyQnnManager::IsNodeSupportedByBackend(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&) fbcode/executorch/backends/qualcomm/aot/python/PyQnnManagerAdaptor.h:57
    #9 0x7f15b9e5b986 in pybind11::cpp_function::cpp_function<bool, torch::executor::qnn::PyQnnManager, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&, pybind11::name, pybind11::is_method, pybind11::sibling>(bool (torch::executor::qnn::PyQnnManager::*)(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&)::operator()(torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&) const fbsource/pybind11/pybind11.h:84
    #10 0x7f15b9e5b8b5 in bool pybind11::detail::argument_loader<torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&>::call_impl<bool, pybind11::cpp_function::cpp_function<bool, torch::executor::qnn::PyQnnManager, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&, pybind11::name, pybind11::is_method, pybind11::sibling>(bool (torch::executor::qnn::PyQnnManager::*)(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&)&, 0ul, 1ul, pybind11::detail::void_type>(torch::executor::qnn::PyQnnManager&&, std::integer_sequence<unsigned long, 0ul, 1ul>, pybind11::detail::void_type&&) && fbsource/pybind11/cast.h:2042
    #11 0x7f15b9e53831 in std::enable_if<!std::is_void<bool>::value, bool>::type pybind11::detail::argument_loader<torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&>::call<bool, pybind11::detail::void_type, pybind11::cpp_function::cpp_function<bool, torch::executor::qnn::PyQnnManager, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&, pybind11::name, pybind11::is_method, pybind11::sibling>(bool (torch::executor::qnn::PyQnnManager::*)(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&)&>(pybind11::cpp_function::cpp_function<bool, torch::executor::qnn::PyQnnManager, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&, pybind11::name, pybind11::is_method, pybind11::sibling>(bool (torch::executor::qnn::PyQnnManager::*)(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&)&) && fbsource/pybind11/cast.h:2014
    #12 0x7f15b9e53454 in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<bool, torch::executor::qnn::PyQnnManager, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&, pybind11::name, pybind11::is_method, pybind11::sibling>(bool (torch::executor::qnn::PyQnnManager::*)(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), bool, torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&, pybind11::name, pybind11::is_method, pybind11::sibling>(bool&&, torch::executor::qnn::PyQnnManager (*)(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::operator()(pybind11::detail::function_call&) const fbsource/pybind11/pybind11.h:193
    #13 0x7f15b9e530d3 in void pybind11::cpp_function::initialize<pybind11::cpp_function::cpp_function<bool, torch::executor::qnn::PyQnnManager, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&, pybind11::name, pybind11::is_method, pybind11::sibling>(bool (torch::executor::qnn::PyQnnManager::*)(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), bool, torch::executor::qnn::PyQnnManager*, std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&, pybind11::name, pybind11::is_method, pybind11::sibling>(bool&&, torch::executor::qnn::PyQnnManager (*)(std::vector<std::shared_ptr<torch::executor::qnn::OpWrapper>, std::allocator<std::shared_ptr<torch::executor::qnn::OpWrapper>>>&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::__invoke(pybind11::detail::function_call&) fbsource/pybind11/pybind11.h:170
    #14 0x7f15b9d8f707 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) fbsource/pybind11/pybind11.h:767
    #15 0x327141 in cfunction_call(_object*, _object*, _object*) (.__uniq.281047882695835599676768160755749362799) (/usr/local/fbcode/platform010/bin/python3.10+0x327141) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #16 0x349630 in _PyObject_MakeTpCall (/usr/local/fbcode/platform010/bin/python3.10+0x349630) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #17 0x5897d4 in method_vectorcall(_object*, _object* const*, unsigned long, _object*) (.__uniq.243338978568352371442406765225626566013.llvm.6236606370933165261) (/usr/local/fbcode/platform010/bin/python3.10+0x5897d4) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #18 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #19 0x331421 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x331421) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #20 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #21 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #22 0x3313f2 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3313f2) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #23 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #24 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #25 0x3313f2 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3313f2) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #26 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #27 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #28 0x3313f2 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3313f2) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #29 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #30 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #31 0x331577 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x331577) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #32 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #33 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #34 0x3313f2 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3313f2) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #35 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #36 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #37 0x3313f2 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3313f2) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #38 0x39b8ca in _PyEval_Vector (/usr/local/fbcode/platform010/bin/python3.10+0x39b8ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #39 0x39ad7d in _PyObject_FastCallDictTstate (/usr/local/fbcode/platform010/bin/python3.10+0x39ad7d) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #40 0x3c8b72 in slot_tp_call(_object*, _object*, _object*) (.__uniq.235726554139783955843240177532338160225) (/usr/local/fbcode/platform010/bin/python3.10+0x3c8b72) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #41 0x392ca8 in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x392ca8) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #42 0x3314ca in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3314ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #43 0x39b8ca in _PyEval_Vector (/usr/local/fbcode/platform010/bin/python3.10+0x39b8ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #44 0x331b18 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x331b18) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #45 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #46 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #47 0x3314ca in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3314ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #48 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #49 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #50 0x3313f2 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3313f2) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #51 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #52 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #53 0x3313f2 in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3313f2) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #54 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #55 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #56 0x3314ca in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3314ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #57 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #58 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #59 0x3314ca in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3314ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #60 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #61 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #62 0x3314ca in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3314ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #63 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #64 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #65 0x3314ca in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3314ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #66 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #67 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #68 0x3314ca in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3314ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #69 0x327547 in _PyFunction_Vectorcall (/usr/local/fbcode/platform010/bin/python3.10+0x327547) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #70 0x3928df in call_function(_ts*, PyTraceInfo*, _object***, long, _object*) (.__uniq.79849310599369217189729546442812793949) (/usr/local/fbcode/platform010/bin/python3.10+0x3928df) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #71 0x3314ca in _PyEval_EvalFrameDefault (/usr/local/fbcode/platform010/bin/python3.10+0x3314ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #72 0x39b8ca in _PyEval_Vector (/usr/local/fbcode/platform010/bin/python3.10+0x39b8ca) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #73 0x431565 in PyEval_EvalCode (/usr/local/fbcode/platform010/bin/python3.10+0x431565) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #74 0x431447 in run_mod(_mod*, _object*, _object*, _object*, PyCompilerFlags*, _arena*) (.__uniq.251861886623903963524397139660542440724.llvm.17622910512627074885) (/usr/local/fbcode/platform010/bin/python3.10+0x431447) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #75 0x4e3054 in pyrun_file(_IO_FILE*, _object*, int, _object*, _object*, int, PyCompilerFlags*) (.__uniq.251861886623903963524397139660542440724) (/usr/local/fbcode/platform010/bin/python3.10+0x4e3054) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #76 0x4e2b54 in _PyRun_SimpleFileObject (/usr/local/fbcode/platform010/bin/python3.10+0x4e2b54) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #77 0x4e28f1 in _PyRun_AnyFileObject (/usr/local/fbcode/platform010/bin/python3.10+0x4e28f1) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #78 0x4d4a54 in Py_RunMain (/usr/local/fbcode/platform010/bin/python3.10+0x4d4a54) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #79 0x4d286b in pymain_main(_PyArgv*) (.__uniq.297908980262787110426434251325078884054) (/usr/local/fbcode/platform010/bin/python3.10+0x4d286b) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #80 0x4d2759 in Py_BytesMain (/usr/local/fbcode/platform010/bin/python3.10+0x4d2759) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
    #81 0x7f19e282c656 in __libc_start_call_main (/usr/local/fbcode/platform010/lib/libc.so.6+0x2c656) (BuildId: 93cdceeb8322234c38e1f2c93ad0ff10c7632fa6)
    #82 0x7f19e282c717 in __libc_start_main@GLIBC_2.2.5 (/usr/local/fbcode/platform010/lib/libc.so.6+0x2c717) (BuildId: 93cdceeb8322234c38e1f2c93ad0ff10c7632fa6)
    #83 0x553d90 in _start (/usr/local/fbcode/platform010/bin/python3.10+0x553d90) (BuildId: a620038add613fd8585eb50983ca8e455d54738e)
AddressSanitizer can not provide additional info.
AddressSanitizer: SEGV (/home/chenlai/fbsource/third-party/qualcomm/qnn/qnn-2.26/lib/x86_64-linux-clang/libQnnHtp.so+0x2ce38e2) (BuildId: bc3ab8ddc89a0e65)
==1523599==ABORTING

Differential Revision: D63736779

Oct 02 '24 01:10 cccclai

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5807

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:x: 13 New Failures

As of commit 35387f6de7c731e8d3f52ce504c2abd912c6f096 with merge base 13408b9848b1a776f03ff0fbac5f18b6347ff64a ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh) >>> Lint for extension/llm/export/quantizer_lib.py:
pull / test-llama-runner-linux (bf16, buck2, portable) / linux-job (gh) RuntimeError: Command docker exec -t 9294fafbac0fca50672d5e242f4ce0781eb8568b0c69c9b6bfb06a7923be8fe3 /exec failed with exit code 1
pull / test-llama-runner-linux (bf16, cmake, portable) / linux-job (gh) RuntimeError: Command docker exec -t 877983a1550c5e0194ed7bd11d779e8c8db8d11da38472a46c7ad18bff4a4f52 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, buck2, portable) / linux-job (gh) RuntimeError: Command docker exec -t ead7431470d56c952ec00a395459d64a46509d20d94f679ad4f2dc066758e510 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, buck2, xnnpack+custom) / linux-job (gh) RuntimeError: Command docker exec -t 204c23e951c505eed4796b2cd8b6ad16a0dfcb35fac4c2e93cdd7b5e739f9b85 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, buck2, xnnpack+custom+qe) / linux-job (gh) RuntimeError: Command docker exec -t 1a8b3526c28616933db8dab1e51f8813fd07df3cc884aa281f8e5c0cf03dbc44 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, portable) / linux-job (gh) RuntimeError: Command docker exec -t d516d93a5df867d186ddc08e176d3532d51a9f9cb6058dec11987fabce37e820 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, xnnpack+custom) / linux-job (gh) RuntimeError: Command docker exec -t a9c2077fd87ea9efb7fc9ddcdd5c2329d7745779eaa3cbd21bb379c8944c8d61 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, xnnpack+custom+qe) / linux-job (gh) RuntimeError: Command docker exec -t 23d2a2467b418eba020633c07cfa0b2610c874f4da8a965140a81964bc318ea3 /exec failed with exit code 1
pull / test-llama-runner-qnn-linux (fp32, cmake, qnn) / linux-job (gh) RuntimeError: Command docker exec -t 912044c80a1e6107a7a6db36d927fa3f102c7fce647199c455d26ae1b40e8d7a /exec failed with exit code 1
pull / test-llava-runner-linux / linux-job (gh) RuntimeError: Command docker exec -t 88348b529af94cb89bd08aa9585fc544ed9a4f2a0aaf60383587c39d321537aa /exec failed with exit code 1
pull / unittest / linux / linux-job (gh) RuntimeError: Command docker exec -t aae6addd12d97ad24b43214d1ce9b7f1602414119c2c7f889cc5d070ca71474e /exec failed with exit code 1
pull / unittest / macos / macos-job (gh) RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Oct 02 '24 01:10 pytorch-bot[bot]

This pull request was exported from Phabricator. Differential Revision: D63736779

Oct 02 '24 01:10 facebook-github-bot

@shewu-quic @haowhsu-quic @chunit-quic

Oct 02 '24 04:10 chiwwang

We will try to reproduce this at our side.

Oct 02 '24 04:10 chiwwang

Hi @cccclai , Thanks for this PR. I could also reproduce the error. About the segmentation fault for per channel 16a4w linear in QNN 2.26, we also find in our unit test. We will investigate more for this issue. If possible, could you use convert_linear_to_conv pass to bypass this issue?

Oct 02 '24 06:10 shewu-quic

We also need to check why the matmul is quantized to an unsupport schema. Maybe something wrong in our QnnQuantizer or so?

Oct 02 '24 06:10 chiwwang

Sad, the segmentation fault of linear was detected around 2.26~2.27 timeframe. The fix is not released yet. ETA is QNN 2.28, which is at the end of Oct.

Oct 02 '24 06:10 chiwwang

We also need to check why the matmul is quantized to an unsupport schema. Maybe something wrong in our QnnQuantizer or so?

Hi @cccclai, @chiwwang,

It seems that I could not reproduce the op validation failed for matmul op on my end when using QNN 2.26 and add the convert_linear_to_conv pass. The below is my call sequence.

./install_requirements.sh
cp schema/*.fbs exir/_serialize/
export PYTHONPATH=/local/mnt/workspace/test_cc/
export ANDROID_NDK=/local/mnt/workspace/shewu/android-ndk-r26c
export QNN_SDK_ROOT=/local/mnt/workspace/shewu/qairt/2.26.0.240828
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang
./backends/qualcomm/scripts/build.sh
python -m executorch.examples.models.llama2.export_llama --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w

Oct 02 '24 07:10 shewu-quic

This pull request was exported from Phabricator. Differential Revision: D63736779

Oct 02 '24 19:10 facebook-github-bot

I update the PR to use linear to conv pass now as the segfault can reproduced now. Here is the latest log prefill_qnn.log

I can see matmul fails to lower

...
[QNN Partitioner Op Support]: aten.convolution.default | True
[QNN Partitioner Op Support]: aten.permute_copy.default | True
[QNN Partitioner Op Support]: aten.unsqueeze_copy.default | True
[QNN Partitioner Op Support]: aten.view_copy.default | True
[QNN Partitioner Op Support]: aten.permute_copy.default | True
[QNN Partitioner Op Support]: aten.matmul.default | False
[QNN Partitioner Op Support]: aten._softmax.default | True
[QNN Partitioner Op Support]: aten.add.Tensor | True
[QNN Partitioner Op Support]: aten.slice_copy.Tensor | True
[QNN Partitioner Op Support]: aten.slice_copy.Tensor | True
[QNN Partitioner Op Support]: aten.div.Tensor | True
[QNN Partitioner Op Support]: aten.matmul.default | False
...

Oct 02 '24 20:10 cccclai

I suddenly realize this is in AOT stage so the mismatch of QNN libraries & executorch (Maybe QnnPyXXXXX.so) should be caused by the mismatch of QNN_SDK_ROOT and LD_LIBRARY_PATH... not on the device yet 😨

Oct 03 '24 00:10 chiwwang

I suddenly realize this is in AOT stage so the mismatch of QNN libraries & executorch (Maybe QnnPyXXXXX.so) should be caused by the mismatch of QNN_SDK_ROOT and LD_LIBRARY_PATH... not on the device yet 😨

Yeah....it is still AOT and not on device yet

Oct 03 '24 00:10 cccclai

I double check again and it looks like I can lower matmul in oss flow, but not internal buck flow, I guess I can workaround for now...

Oct 03 '24 02:10 cccclai

I double check again and it looks like I can lower matmul in oss flow, but not internal buck flow, I guess I can workaround for now...

I'm also stuck in buck build-flow. Let me submit a comment below. I aim to add the soc information to QC backend. I remembered we moved these things to Python.

Oct 03 '24 02:10 chiwwang

We also need to check why the matmul is quantized to an unsupport schema. Maybe something wrong in our QnnQuantizer or so?

Hi @cccclai, @chiwwang,

It seems that I could not reproduce the op validation failed for matmul op on my end when using QNN 2.26 and add the convert_linear_to_conv pass. The below is my call sequence.
./install_requirements.sh
cp schema/*.fbs exir/_serialize/
export PYTHONPATH=/local/mnt/workspace/test_cc/
export ANDROID_NDK=/local/mnt/workspace/shewu/android-ndk-r26c
export QNN_SDK_ROOT=/local/mnt/workspace/shewu/qairt/2.26.0.240828
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang
./backends/qualcomm/scripts/build.sh
python -m executorch.examples.models.llama2.export_llama --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w

should be ANDROID_NDK_ROOT instead of ANDROID_NDK.

Oct 03 '24 02:10 chiwwang

Hi @cccclai I added the SOC here: https://github.com/cccclai/executorch-1/pull/1 I ran a silly model with soc_model=SSG2115P on a SM8550 and it seems OK. I will test the command shared here.

[update]

python -m executorch.examples.models.llama2.export_llama --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w --soc_model SSG2115P

seems to work 😮

Oct 03 '24 03:10 chiwwang

Hi @cccclai I add a PR to quantize embedding op and 16x8 matmul. I ran this model, and it could fully delegate. If you have any problem, please let me know.

python -m executorch.examples.models.llama2.export_llama --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w

Oct 03 '24 06:10 shewu-quic

Hi @cccclai I add a PR to quantize embedding op and 16x8 matmul. I ran this model, and it could fully delegate. If you have any problem, please let me know.
python -m executorch.examples.models.llama2.export_llama --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w

Hey @shewu-quic Is there PR link? And I'm thinking if you can add more descriptions to let us understand how you achieve this, that would be great 😄

Oct 03 '24 06:10 chiwwang

Oh~ sure, let me add more descriptions for this PR About 16x8 matmul op, I think it can be divided into two types according to whether to use kv cache. - if use kv cache, we could annotate 8 bits to past KV (input) along the second input of matmul to lower the size of input tensor. - if not use kv cache, I think we could just annotate 16x8 matmul to improve performance.

By default, we annotate matmul with 16x16 in 16 bits quantization and we could override it with add_custom_quant_annotations.

Oct 03 '24 07:10 shewu-quic

So it's "custom annotation", almost based on the topology of the graph, right? We look into the graph and choose a node to annotate, which helps us to obtain 16x8 matmul. Do I understand correctly?

Oct 03 '24 07:10 chiwwang

So it's "custom annotation", almost based on the topology of the graph, right? We look into the graph and choose a node to annotate, which helps us to obtain 16x8 matmul. Do I understand correctly?

Yes, that right. After applying the custom annotation, you could get the below.

                                                                  q (16 bits) -> dq (16 bits)--\
                                                                                                 matmul -> q (16 bits) -> dq (16 bits)
q (16 bits) -> dq (16 bits) -> op -> q (16 bits) -> dq (16 bits) -> q (8 bits) -> dq (8 bits)--/

For q (16 bits) -> dq (16 bits) -> q (8 bits) -> dq (8 bits) pattern, we will tag requantize to op and insert to_copy (QNN Convert or Cast) after op.

Oct 03 '24 07:10 shewu-quic

Got it Thanks. Note that the command should contain --soc_model SSG2115P for correct VTCM size. (need PR https://github.com/cccclai/executorch-1/pull/1, though) python -m executorch.examples.models.llama2.export_llama --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w --soc_model SSG2115P

Oct 03 '24 08:10 chiwwang

Thanks folks! I was able to get the model running with embedding/matmul lower with these changes. Maybe we can extend the soc table? The change looks reasonable to me.

Oct 03 '24 18:10 cccclai

layer norm op lowering:

We have a different model using layernorm instead rmsnorm, because the runtime just recently bumps to 2.25 and the current model still uses layernorm, I'll make change on this PR with the PRs your folks sent to test both layernorm and rmsnorm.

[edit]: made some progress on it. The bias node quant node looks suspicious...it's

    %dequantize_per_tensor_2 : [num_users=1] = call_function[target=torch.ops.quantized_decomposed.dequantize_per_tensor.default](args = (%b__frozen_param2, 9.5367431640625e-07, 0, -2147483648, 2147483647, torch.int32), kwargs = {})
...
    %layer_norm : [num_users=1] = call_function[target=torch.ops.aten.layer_norm.default](args = (%dequantize_per_tensor_11, [64], %dequantize_per_tensor_1, %dequantize_per_tensor_2), kwargs = {})

...

Oct 03 '24 19:10 cccclai

In the meanwhile, we're tracking latency (both model loading time and inference time), memory, power and accuracy for production. Latency and accuracy are easier, how about memory and power?

Oct 03 '24 19:10 cccclai

This pull request was exported from Phabricator. Differential Revision: D63736779

Oct 03 '24 21:10 facebook-github-bot

Hi @cccclai I add a PR to quantize embedding op and 16x8 matmul. I ran this model, and it could fully delegate. If you have any problem, please let me know.
python -m executorch.examples.models.llama2.export_llama --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w

Thanks! I was able to lower embedding, however the latency seems very close to the cpu version. Maybe we will have a better memory usage by using the qnn embedding? I feel like the alternative solution includes: 0. use cpu fp embedding

use 16x8 qnn embedding
use cpu 4bit embedding https://github.com/pytorch/executorch/blob/13408b9848b1a776f03ff0fbac5f18b6347ff64a/kernels/quantized/cpu/op_embedding4b.cpp

and then maybe we have a better understanding on the latency/memory for these options.

Oct 03 '24 21:10 cccclai

Oh~ sure, let me add more descriptions for this PR About 16x8 matmul op, I think it can be divided into two types according to whether to use kv cache. - if use kv cache, we could annotate 8 bits to past KV (input) along the second input of matmul to lower the size of input tensor. - if not use kv cache, I think we could just annotate 16x8 matmul to improve performance.

By default, we annotate matmul with 16x16 in 16 bits quantization and we could override it with add_custom_quant_annotations.

this is working well. Thanks! Also wonder if have know the latency/memory compared between 16x8 vs 8x8?

Oct 03 '24 21:10 cccclai

[edit]: made some progress on it. The bias node quant node looks suspicious...it's

    %dequantize_per_tensor_2 : [num_users=1] = call_function[target=torch.ops.quantized_decomposed.dequantize_per_tensor.default](args = (%b__frozen_param2, 9.5367431640625e-07, 0, -2147483648, 2147483647, torch.int32), kwargs = {})
...
    %layer_norm : [num_users=1] = call_function[target=torch.ops.aten.layer_norm.default](args = (%dequantize_per_tensor_11, [64], %dequantize_per_tensor_1, %dequantize_per_tensor_2), kwargs = {})

...

We quantize bias node to int32 by default. https://github.com/cccclai/executorch-1/blob/35387f6de7c731e8d3f52ce504c2abd912c6f096/backends/qualcomm/quantizer/utils.py#L1074

Oct 04 '24 02:10 shewu-quic

this is working well. Thanks! Also wonder if have know the latency/memory compared between 16x8 vs 8x8?

Do you mean quantize the model in 8x8? Maybe we could give it a try, but I doubt we'd get reasonable accuracy on it.

Oct 04 '24 02:10 shewu-quic