executorch [Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp

Summary:

Fully delegated meta llama model in Qnn
Add simple calibration
Use custom fallback op to split graph
Add model sharding argument
Add splill fill feature.
Keep int64 input tensors for minimum changing of this PR. But it will result in embedding op fallback. If change pos_ids to int32, it will be fully delegated.

There are still accuracy issues for llama 7b in 16a4w and more complicated quantization algorithms are needed. Note that if you want to run llama 7b due to memory limitations on the device, you need to specify num_sharding. And it is recommended to reboot device before running to ensure that the device has enough memory.

Install executorch and backend lib:

cmake \
    -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=$QNN_SDK_ROOT \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-android-out .

  cmake --build cmake-android-out -j4 --target install --config Release

Build llama runner:

cmake \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK"/build/cmake/android.toolchain.cmake  \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out/examples/models/llama2 examples/models/llama2

    cmake --build cmake-android-out/examples/models/llama2 -j4 --config Release

Export llama in qnn:

# fp16
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
 -c <checkpoint.pth>  --use_kv_cache  --qnn --disable_dynamic_shape

# 8a8w
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>  --use_kv_cache  --qnn --pt2e_quantize qnn_8a8w

# 16a4w
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>   --use_kv_cache  --qnn --pt2e_quantize qnn_16a4w

# llama 7b 16a4w (recommend)
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>   --use_kv_cache  --qnn --disable_dynamic_shape --num_sharding 8 \
--pt2e_quantize qnn_16a4w

Local Results: llama-7b-chat with 8 splits in 16a4w

story llama in 16a4w

story llama in 8a8w

story llama in fp16

Jun 21 '24 06:06 shewu-quic

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4030

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:x: 7 New Failures

As of commit e68e225a742cc0132064cd061343319c8216a8ef with merge base de300e0ca12627f83ac31a4341fac7f01a55f077 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh) >>> Lint for extension/llm/export/quantizer_lib.py:
pull / test-llama-runner-linux (fp32, buck2, portable) / linux-job (gh) RuntimeError: Command docker exec -t 848ba08cae20207ef9a17eb673bf8915d854ec714bb36164f993cdcc251c8459 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, buck2, xnnpack+custom) / linux-job (gh) RuntimeError: Command docker exec -t 849daf76229a39ec6ea00140c5728443cd86e10eb0d8ce7e1f4f173b8213d64e /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, buck2, xnnpack+custom+qe) / linux-job (gh) RuntimeError: Command docker exec -t d3d8b39a795e42ca074a518be2cc1174c465f482c723e8fe793826ed923ebaf8 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, portable) / linux-job (gh) RuntimeError: Command docker exec -t 70db20e2d90b08015f49f4016e982fc080d0032145617d861b7c0ac3cb9e5725 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, xnnpack+custom) / linux-job (gh) RuntimeError: Command docker exec -t f4efd8d911c3835189319323fb1a0a383457ca5625e815a08b76b621b1447366 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, xnnpack+custom+qe) / linux-job (gh) RuntimeError: Command docker exec -t db96eca930b8f0acd70d3a58e06f6a259c0696e2f32e30446992caddd4a95376 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jun 21 '24 06:06 pytorch-bot[bot]

@shewu-quic great job! does it support llama2 7b?

Jun 21 '24 07:06 xiaoxiaoyuwen

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b. We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

Jun 24 '24 02:06 shewu-quic

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b. We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

Another challenge we need to conquer is model sharding.

Jun 24 '24 04:06 chiwwang

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b. We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

Another challenge we need to conquer is model sharding.

Actually I have a version to support model sharding and can share the example code

Jun 24 '24 04:06 cccclai

Hi @cccclai,

The accuracy issue seems to be related to insufficient calibration. May I know do you have any plan to use more data to calibrate the model? If I add the following, it will get reasonable English sentences in quantized model.

    def calibrate(self, module: torch.fx.GraphModule):
        from sentencepiece import SentencePieceProcessor
        sp_model = SentencePieceProcessor(model_file="tokenizer.model")

        # TODO: change criteria & support batch inputs if necessary
        pos = torch.tensor(0, dtype=torch.int32)
        token_list = [sp_model.bos_id()]
        user_prompts = ["Once", "upon", "a", "time"]
        for prompt in user_prompts:
            token_list += sp_model.encode(prompt)

        def sample_top_p(probs: torch.Tensor, top_p: float) -> torch.Tensor:
            probs_sort, probs_indices = torch.sort(probs, dim=-1, descending=True)
            probs_sum = torch.cumsum(probs_sort, dim=-1)
            mask = probs_sum - probs_sort > top_p
            probs_sort[mask] = 0
            probs_sort /= probs_sort.sum(dim=-1, keepdim=True)
            next_token = torch.multinomial(probs_sort, num_samples=1)
            return probs_indices.gather(dim=-1, index=next_token)

        with torch.no_grad():
            while token_list[-1] != sp_model.eos_id() and pos < 128:
                logits = module(
                    torch.full((1, 1), token_list[pos]),
                    torch.full((1, 1), pos),
                )
                pos += 1
                if pos >= len(token_list):
                    token_list.append(torch.argmax(logits[:, -1], dim=-1).item())
                    # probs = torch.softmax(logits[:, -1] / 0.8, dim=-1)
                    # token_list.append(sample_top_p(probs, 0.9).item())

        print(f"calibration data:\n{sp_model.decode(token_list)}")

....
                m = prepare_pt2e(self.pre_autograd_graph_module, composed_quantizer)
                # Calibrate
                self.calibrate(m)
                # m(*self.example_inputs)
                m = convert_pt2e(m)
....

Jun 24 '24 09:06 shewu-quic

If I add the following, it will get reasonable English sentences in quantized model.

Ah yes we will use a more generic to calibrate. I merged this pr (https://github.com/pytorch/executorch/pull/3756) such that we can use the lm eval to calibrate the model

Jun 24 '24 18:06 cccclai

Actually I have a version to support model sharding and can share the example code

May I know how you shard the model? I have three ways of sharding the model but I think all are a bit hardcode....

Fallback specified aten_add_tensor op, it seems there are fix number add op in each layer
Insert clone op after the specific layer, in qnn we will fallback clone op.
Re-write the Transformer

Ah yes we will use a more generic to calibrate. I merged this pr (https://github.com/pytorch/executorch/pull/3756) such that we can use the lm eval to calibrate the model

Thanks for your information. Will it use to export_llama_lib?

Jul 01 '24 07:07 shewu-quic

shard

Sorry for the delay, was distracted by the performance review last week...I use the ExecutorBackend, and tag every 8 layers, will publish soon. I think having a noop op (maybe a custom op instead of clone because clone can also be expensive) for cutting the model can also be a generic way to shard model too.

Jul 01 '24 18:07 cccclai

This is my current change, still trying to debug an op but it's getting close.. model_sharding.patch

This is pretty much the idea

I think it still worth exploring the custom noop solution to break the graph. What is your preference?

Jul 02 '24 17:07 cccclai

This is my current change, still trying to debug an op but it's getting close.. model_sharding.patch

This is pretty much the idea

Wow, it makes me clear how to run the sharding model at runtime. I will try this patch as soon as possible!!

I think it still worth exploring the custom noop solution to break the graph. What is your preference?

I think it is a good idea. In fact, I have tried to hardcode insert custom op in llama_transformer.py and fallback it in qnn partitioner. It should be work after I implement custom kernel implementation. But I have no idea to generally insert custom op with a transformation. Do you have any idea?

# custom_fallback_op.py
from torch.library import impl, Library

fallback_op_lib = Library("qnn_llama", "DEF")

fallback_op_lib.define("fallback(Tensor input) -> Tensor") 


@impl(fallback_op_lib, "fallback", dispatch_key="CompositeExplicitAutograd")
def fallback_impl(a: torch.Tensor) -> torch.Tensor:
    return a


# registering the out variant.
fallback_op_lib.define(
    "fallback.out(Tensor input, *, Tensor(a!) output) -> Tensor(a!)"
)

# split_graph.py
class SplitGraph(ExportPass):
    def __init__(self, shares):
        super().__init__()
        self.shares = shares
    def _insert_fallback_op(
        self, graph_module: torch.fx.GraphModule
    ) -> torch.fx.GraphModule:
        for node in graph_module.graph.nodes:
            if "nn_module_stack" in node.meta:
                module_values_list = list(node.meta["nn_module_stack"].values())
                full_qualified_name = module_values_list[-1][0]
                owning_module = module_values_list[-1][1]
                print(f"[Hutton] node: {node}; full_qualified_name: {full_qualified_name}; owning_module: {owning_module}; meta: {node.meta}")
            # if node not in [the node which wants to find]:
            #   continue
            with graph_module.graph.inserting_after(node):
                users = list(node.users.keys())
                inserted_node = graph_module.graph.create_node(
                    "call_function",
                    exir_ops.edge.qnn_llama.fallback.default,
                    (node,),
                )
                inserted_node.meta["val"] = node.meta["val"]
                for user in users:
                    user.replace_input_with(node, inserted_node)
    def call(self, graph_module: torch.fx.GraphModule):
        self._insert_fallback_op(graph_module)
        graph_module.recompile()

Jul 03 '24 03:07 shewu-quic

This is great. I think if we have a custom graph break op, it doesn't have to qnn specific and can be applicable to other flow or backends.

But I have no idea to generally insert custom op with a transformation. Do you have any idea?

Like where to insert this custom op inside the graph? I feel like we can find the last node of 8 layer based on source_fn and module stack. Is it not working?

Another question is, I image we need to unload the qnn context binary in the graph break custom op. Is it what you're doing?

Also the patch is pretty much the idea. There is a bug I need to fix before it's working properly...I'll send another patch soon

Jul 03 '24 04:07 cccclai

This is great. I think if we have a custom graph break op, it doesn't have to qnn specific and can be applicable to other flow or backends.

Sounds great.

But I have no idea to generally insert custom op with a transformation. Do you have any idea?

Like where to insert this custom op inside the graph? I feel like we can find the last node of 8 layer based on source_fn and module stack. Is it not working?

I originally thought so too but I found it will get multiple nodes in the same layer. The last node of the layer is add node. However, you could find #L466 and #L470 which are the same source_fn and module stack. https://github.com/pytorch/executorch/blob/28a45cdbe1bf1b41673b6418a09b040e1c0733e9/examples/models/llama2/llama_transformer.py#L461 So maybe I also need stack_trace to identify which node we want. Is it stable?

Another question is, I image we need to unload the qnn context binary in the graph break custom op. Is it what you're doing?

Do you mean we need to handle the life cycle of the processed in the custom op? Originally, we load qnn context binary in init function of QnnBackend and unload it in destroy function of QnnBackend. For the life cycle of qnn context binary is decided by processed which is kept by executorch runtime framework. Is this understanding correct?

Also the patch is pretty much the idea. There is a bug I need to fix before it's working properly...I'll send another patch soon

Thanks a lot.

Jul 03 '24 05:07 shewu-quic

This PR is somehow based on https://github.com/pytorch/executorch/pull/4142

We will continue llama2-7b tasks by this PR.

Jul 03 '24 08:07 chiwwang

The last node of the layer is add node. However, you could find #L466 and #L470 which are the same source_fn and module stack. So maybe I also need stack_trace to identify which node we want. Is it stable?

hmm was thinking if finding the last add node for the current layer sufficient, but maybe I miss something. Combing stact_trace also sounds reasonable.

Do you mean we need to handle the life cycle of the processed in the custom op? Originally, we load qnn context binary in init function of QnnBackend and unload it in destroy function of QnnBackend. For the life cycle of qnn context binary is decided by processed which is kept by executorch runtime framework. Is this understanding correct?

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

Jul 08 '24 04:07 cccclai

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Jul 08 '24 04:07 shewu-quic

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

Jul 08 '24 04:07 cccclai

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

It works with multiple pte case. If we want to enable multi-context, we just need to set the right group handle for each pte which is the first context handle. For the purpose, we use a static variable to accomplish it And we need to set max_sf_buf_size which is the size of blob in AOT. You could get the detail in the qnn doc.

Jul 08 '24 05:07 shewu-quic

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

It works with multiple pte case. If we want to enable multi-context, we just need to set the right group handle for each pte which is the first context handle. For the purpose, we use a static variable to accomplish it And we need to set max_sf_buf_size which is the size of blob in AOT. You could get the detail in the qnn doc.

I was checking the doc

When multiple models are executed in sequence and as a result it is possible to reserve a single spill-fill allocation that could be re-used across all the splits. This has the benefit of reducing RAM usage for the application at negligible performance impact.

To my understanding the spill-fill is used for intermediate tensors among the splits. Like the split_1 -> output (in spill-fill) -> split_2. It's for the input/output like activation, but I'm not sure if it will do any optimization for weights. Did I miss anything?

Jul 08 '24 05:07 cccclai

To my understanding the spill-fill is used for intermediate tensors among the splits. Like the split_1 -> output (in spill-fill) -> split_2. It's for the input/output like activation, but I'm not sure if it will do any optimization for weights. Did I miss anything?

According to your description, it should be shared buffer (zero copy) which can eliminate data copy between multi ptes on the CPU and HTP accelerator. It's for the input/output of the graph. We have implemented it in executorch and use it in our llama runner.

Spill-fill buffer sharing is optimization which is to allocate a buffer that will be shared by all the contexts of a LLM. This way, we do not need allocated space for each of the graphs.

Jul 08 '24 06:07 shewu-quic

Spill-fill buffer sharing is optimization which is to allocate a buffer that will be shared by all the contexts of a LLM. This way, we do not need allocated space for each of the graphs.

That's my understanding too and I thought it was for re-using the input/output across all splits in VTCM, but not for weights across all splits. Like

..act_1 -> split_1 -> act_2 -> split_2 -> act_3 -> split_ 4...

here act_1, act_2 and act_3 will share the same buffer, as known as the spill tensor buffer here.

Jul 09 '24 04:07 cccclai

Hey I probably need some help to fix a matmul validation error - it causes graph break but I'm not sure what's the issue. It only shows up after I apply the model sharding patch, but the graph inside the qnn_partitioner is supposed to be the same as the first layer of the graph.

I debug inside op_matmul.py, The input nodes for matmul are exactly the same for the validation success case and failure cases. The error code is

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> validateNativeOps aten_matmul_default:qti.aisw:MatMul op validator (quantized and FP16) failed 3110 and 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_matmul_default with error 0xc26

For both op validation success and failure cases, the input nodes are exactly the same. The first matmul op input node is [aten__softmax_default, aten_permute_copy_default_5] and the second matmul input node [aten_permute_copy_default_3, aten_permute_copy_default_6]. Do I miss anything?

Jul 09 '24 05:07 cccclai

That's my understanding too and I thought it was for re-using the input/output across all splits in VTCM, but not for weights across all splits. Like

..act_1 -> split_1_1 -> intermediate tensor_0 -> split_1_2 -> act_2 -> split_2 -> act_3 -> split_ 4...

here act_1, act_2 and act_3 will share the same buffer, as known as the spill tensor buffer here.

I feel we are misaligned on some terms.

Shared buffer (Zero copy)

The propose is to avoid data copy between CPU and HTP. In addition, we could create a bigger rpc memory to be stored act_1, act_2,... etc. We have implemented in our llama2. It will create a rpc memeory to be stored for all input and output and just set the correct offset for each I/O tensor.

Spill-Fill buffer

VTCM space for each of the SoC is limited hence, when we need to make space within this region, the data is copied back to DDR (spill-fill buffer in this case). Therefore, we allocate one spill buffer for the intermediate tensors in graph (split).

VTCM

It is a hardware resource which provides fast store and load. It is controlled by HTP and we could only set the maximum usage for VTCM.

So back to your example, act_1, act_2 and act_3 (I/O tensor) will share the same buffer which is rpc memory instead of spill tensor buffer. For intermediate tensor in each graph (split), they will use a spill-fill buffer.

...act_1 (rpc_mem) -> split_1_1 -> intermediate tensor_0 ... (spill fill buffer)-> split_1_2 -> act_2 (rpc_mem) -> split_2 (spill fill buffer -> act_3 (rpc_mem) -> split_ 4 (spill fill buffer...

Jul 09 '24 06:07 shewu-quic

Hey I probably need some help to fix a matmul validation error - it causes graph break but I'm not sure what's the issue. It only shows up after I apply the model sharding patch, but the graph inside the qnn_partitioner is supposed to be the same as the first layer of the graph.

I debug inside op_matmul.py, The input nodes for matmul are exactly the same for the validation success case and failure cases. The error code is
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> validateNativeOps aten_matmul_default:qti.aisw:MatMul op validator (quantized and FP16) failed 3110 and 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_matmul_default with error 0xc26
For both op validation success and failure cases, the input nodes are exactly the same. The first matmul op input node is [aten__softmax_default, aten_permute_copy_default_5] and the second matmul input node [aten_permute_copy_default_3, aten_permute_copy_default_6]. Do I miss anything?

May I know which version of QNN are you using?

Jul 09 '24 06:07 shewu-quic

If you use quantization, I think the problem is missing quant attr or something wrong for quant parameter in meta of node. Could you help to check it? https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/HtpOpDefSupplement.html

Jul 09 '24 06:07 shewu-quic

I'm using qnn 2.23 and the matmul node meta data is success case:

{'stack_trace': '  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 492, in forward\n    h = layer(\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 429, in forward\n    h = self.attention.forward(\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 330, in forward\n    output = self.SDPA(input_pos, q, k, v, bsz, seqlen, self.mask)\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/source_transformation/sdpa.py", line 144, in forward\n    attn_weight = q @ k.transpose(-2, -1) * scale_factor\n', 'nn_module_stack': {'L__self__': ('', 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'), 'L__self___layers_0': ('layers.0', 'executorch.examples.models.llama2.llama_transformer.TransformerBlock'), 'L__self___layers_0_attention_SDPA': ('layers.0.attention.SDPA', 'examples.models.llama2.source_transformation.sdpa.SDPAQNN')}, 'torch_fn': ('matmul.default_1', 'OpOverload.matmul.default'), 'source_fn_stack': [('matmul', <built-in function matmul>)], 'original_aten': <OpOverload(op='aten.view', overload='default')>, 'from_node': [('matmul', <built-in function matmul>), ('view_17', <OpOverload(op='aten.view', overload='default')>), ('view_copy_17', <OpOverload(op='aten.view_copy', overload='default')>), ('aten_matmul_default', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 8, 1, 128)), 'tensor_meta': None, 'debug_handle': 239, 'quant_attrs': {'scale': 3.819849371211603e-05, 'zero_point': 18610, 'quant_min': 0, 'quant_max': 65535, 'dtype': torch.int32, 'encoding': <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor}}

fail case

{'stack_trace': '  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 492, in forward\n    h = layer(\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 429, in forward\n    h = self.attention.forward(\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 330, in forward\n    output = self.SDPA(input_pos, q, k, v, bsz, seqlen, self.mask)\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/source_transformation/sdpa.py", line 144, in forward\n    attn_weight = q @ k.transpose(-2, -1) * scale_factor\n', 'nn_module_stack': {'L__self__': ('', 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'), 'L__self___layers_0': ('layers.0', 'executorch.examples.models.llama2.llama_transformer.TransformerBlock'), 'L__self___layers_0_attention_SDPA': ('layers.0.attention.SDPA', 'examples.models.llama2.source_transformation.sdpa.SDPAQNN')}, 'torch_fn': ('matmul.default_1', 'OpOverload.matmul.default'), 'source_fn_stack': [('matmul', <built-in function matmul>)], 'original_aten': <OpOverload(op='aten.view', overload='default')>, 'from_node': [('matmul', <built-in function matmul>), ('view_17', <OpOverload(op='aten.view', overload='default')>), ('view_copy_17', <OpOverload(op='aten.view_copy', overload='default')>), ('aten_matmul_default_1', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor), ('quantized_decomposed_quantize_per_tensor_default_54', <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor), ('aten_matmul_default_1', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 8, 1, 128)), 'tensor_meta': None, 'debug_handle': 239, 'delegation_tag': 'L__self___layers_0_1', 'quant_attrs': {'scale': 3.819849371211603e-05, 'zero_point': 18610, 'quant_min': 0, 'quant_max': 65535, 'dtype': torch.int32, 'encoding': <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor}}

They look very similar....how would you debug next?

Jul 09 '24 06:07 cccclai

How about the quant attr of matmul's inputs?

Jul 09 '24 06:07 shewu-quic

Ohhh~! I suddenly get your point about model sharding backend. Because the difference between this llama and our composite llama is one and multiple pte files. For our multiple pte files, we could register rpc memory for all input and output of each split in runner. But this llama which is only one pte file could not get the I/O tensor between each split. Therefore, you write the backend to fulfill it, right?

I think you could not worry about weight tensors for each split when you set max_sf_buf_size. Because it will be optimized in HTP.

In my view, we just need to figure out how we could manage RPC memory for I/O tensors of each split in model sharding backend. Did I miss anything?

Jul 09 '24 08:07 shewu-quic

I feel like there is still a bit misalignment 😅 ...but if the composite llama includes multiple .pte, I think I get what I need now.

back to debugging the matmul op validation, I still haven't figured out the reason....the input node.meta[quant_attr] looks fine too. Is there a way to build the aot library debug version to get more error message?

Jul 09 '24 15:07 cccclai

I set debug=True in the compile spec and get more logging

[INFO] [Qnn ExecuTorch]: Validating Op Config aten_matmul_default_1.
[INFO] [Qnn ExecuTorch]: Validating Op Type MatMul == MatMul.
[INFO] [Qnn ExecuTorch]: Validating Inputs.
[INFO] [Qnn ExecuTorch]: Validating Input[0] of ID 0.
[INFO] [Qnn ExecuTorch]: Validating Input[1] of ID 0.
[INFO] [Qnn ExecuTorch]: Validating Params.
[INFO] [Qnn ExecuTorch]: Validating Outputs.
[INFO] [Qnn ExecuTorch]: Validating Output[0] of ID 0.
[INFO] [Qnn ExecuTorch]: QnnDsp <V> Validating Op Config aten_matmul_default_1.
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Input[1] has incorrect Value 69, expected >= 73.
[INFO] [Qnn ExecuTorch]: QnnDsp <V> validateNativeOps aten_matmul_default_1:qti.aisw:MatMul htp op validator (quantized) failed 3110
[INFO] [Qnn ExecuTorch]: QnnDsp <V> Validating Op Config aten_matmul_default_1.
[INFO] [Qnn ExecuTorch]: QnnDsp <V> Input[0] has incorrect Datatype 0x416.
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> validateNativeOps aten_matmul_default_1:qti.aisw:MatMul op validator (quantized and FP16) failed 3110 and 3110
[INFO] [Qnn ExecuTorch]: QnnDsp <V> registered validator failed => 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110
[INFO] [Qnn ExecuTorch]: QnnDsp <V> Wake up free backend 1 thread(s)
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_matmul_default_1 with error 0xc26
[WARNING] [Qnn ExecuTorch]: Qnn Backend op validation failed with error: 3110

Is there any log suspicious?

Jul 09 '24 19:07 cccclai