executorch Mutable buffer fails to lower to QNN backend

When I lower this toy example to QNN backend, the linear operator cause an error.

    class MutableStateModule(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.register_buffer("state", torch.tensor(torch.zeros(1,1)))

        def forward(self, x):
            self.state.add_(torch.ones(1,1))
            return x @ self.state.T

Here is message:

Traceback (most recent call last):
  File "/home/lei/llama_mobile/python/android/test2.py", line 89, in <module>
    build_executorch_binary(
  File "/home/lei/llama_mobile/python/android/utils.py", line 215, in build_executorch_binary
    edge_prog.exported_program = to_backend(edge_prog.exported_program, qnn_partitioner)
  File "/home/lei/miniconda3/envs/executorch/lib/python3.10/functools.py", line 878, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/lei/llama_mobile/executorch/exir/backend/backend_api.py", line 363, in _
    partitioner_result = partitioner_instance(fake_edge_program)
  File "/home/lei/llama_mobile/executorch/exir/backend/partitioner.py", line 64, in __call__
    return self.partition(exported_program)
  File "/home/lei/llama_mobile/executorch/backends/qualcomm/partition/qnn_partitioner.py", line 138, in partition
    partitions = self.generate_partitions(edge_program)
  File "/home/lei/llama_mobile/executorch/backends/qualcomm/partition/qnn_partitioner.py", line 124, in generate_partitions
    return generate_partitions_from_list_of_nodes(
  File "/home/lei/llama_mobile/executorch/exir/backend/canonical_partitioners/pattern_op_partitioner.py", line 50, in generate_partitions_from_list_of_nodes
    partition_list = capability_partitioner.propose_partitions()
  File "/home/lei/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/partitioner.py", line 201, in propose_partitions
    if self.__is_node_supported(node) and node not in assignment:
  File "/home/lei/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/fx/passes/infra/partitioner.py", line 79, in __is_node_supported
    self.operator_support.is_node_supported(dict(self.graph_module.named_modules()), node)
  File "/home/lei/llama_mobile/executorch/backends/qualcomm/partition/qnn_partitioner.py", line 79, in is_node_supported
    op_wrapper = self.node_visitors[node.target.__name__].define_node(
  File "/home/lei/llama_mobile/executorch/backends/qualcomm/builders/op_linear.py", line 52, in define_node
    weight_tensor_wrapper = self.define_tensor(
  File "/home/lei/llama_mobile/executorch/backends/qualcomm/builders/node_visitor.py", line 301, in define_tensor
    dims = [1] if len(tensor.size()) == 0 else tensor.size()
AttributeError: 'NoneType' object has no attribute 'size'

If I replace linear operation x @ self.state.T with addtion operation x + self.state.T, it will work.

Jun 27 '24 04:06 leigao97

@cccclai For QNN Lowering

Jun 27 '24 05:06 Jack-Khuu

what does the graph look like for this toy model?

Jun 27 '24 06:06 cccclai

The graph looks like this:

def forward(self, b_state, x):
    aten_full_default = executorch_exir_dialects_edge__ops_aten_full_default([1, 1], 1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
    aten_add_tensor = executorch_exir_dialects_edge__ops_aten_add_Tensor(b_state, aten_full_default);  b_state = aten_full_default = None
    aten_linear_default = executorch_exir_dialects_edge__ops_aten_linear_default(x, aten_add_tensor);  x = None
    return (aten_add_tensor, aten_linear_default)

Jun 27 '24 07:06 leigao97

Is it after torch.export or to_edge? Mind sharing the repro script?

Jun 27 '24 16:06 cccclai

It looks like the pre-autograd ATen dialect graph. Here is my script:

import torch

from executorch.backends.qualcomm.partition.qnn_partitioner import QnnPartitioner
from executorch.backends.qualcomm.utils.utils import (
    capture_program,
    generate_htp_compiler_spec,
    generate_qnn_executorch_compiler_spec,
)
from executorch.backends.qualcomm.serialization.qnn_compile_spec_schema import (
    QcomChipset,
)
from executorch.exir.backend.backend_api import to_backend


class MutableStateModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.register_buffer("state", torch.tensor(torch.zeros(1,1)))

    def forward(self, x):
        self.state.add_(torch.ones(1,1))
        return x @ self.state.T
    
model = MutableStateModule()

inputs = (torch.zeros(1,1),)

edge_prog = capture_program(model, inputs)

qnn_partitioner = QnnPartitioner(
    generate_qnn_executorch_compiler_spec(
        soc_model=QcomChipset.SM8650,
        backend_options=generate_htp_compiler_spec(use_fp16=True),
        debug=False,
        saver=False,
        shared_buffer=False,
    ),
)
edge_prog.exported_program = to_backend(edge_prog.exported_program, qnn_partitioner)

From my side, I need to set environment variables like $EXECUTORCH_ROOT and PYTHONPATH before running the script. Here is the reference: https://pytorch.org/executorch/stable/build-run-qualcomm-ai-engine-direct-backend.html#setting-up-your-developer-environment

Jun 27 '24 18:06 leigao97

Are there any updates on this issue? Thanks!

Jun 28 '24 20:06 leigao97

Not yet - I think it's similar to this issue: https://github.com/pytorch/executorch/issues/4042

Jun 29 '24 00:06 cccclai

The graph looks like this:

def forward(self, b_state, x):
    aten_full_default = executorch_exir_dialects_edge__ops_aten_full_default([1, 1], 1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)
    aten_add_tensor = executorch_exir_dialects_edge__ops_aten_add_Tensor(b_state, aten_full_default);  b_state = aten_full_default = None
    aten_linear_default = executorch_exir_dialects_edge__ops_aten_linear_default(x, aten_add_tensor);  x = None
    return (aten_add_tensor, aten_linear_default)

As shown in the graph, the linear op takes the output from the add op as its weight tensor. In the code below, the weight_node is aten_add_tensor and it is None. https://github.com/pytorch/executorch/blob/5584b9e3c865edca239ec5df6346f1d1aabb0276/backends/qualcomm/builders/op_linear.py#L51 And the weight_tensor_wrapper needs to get the value of the weight tensor, which causes the error.

I replaced the get_parameter function with the get_tensor function and it seems to work.

weight_tensor = self.get_tensor(weight_node, node)

Is this an okay bypass? Thank you.

Jul 08 '24 04:07 leigao97

Hmm does it work on runtime? I sort of doubt it...

Jul 08 '24 21:07 cccclai

@leigao97 hey just would like to follow up on this, are you still blocked on the issue?

Jul 18 '24 00:07 cccclai

Yes, the modification above was not correct. The reason why I encounter this issue is that I would like to run int8 weight only quantized model on QNN backend by following this procedure:

https://github.com/pytorch/executorch/blob/b448254a88edc6c30d8926abcebf5a1871d675cf/examples/models/llama2/source_transformation/quantize.py#L355

I found the root reason is that if we perform any operation on the buffer, then there will be an operator, and the output of that operator doesn't have a parameter value, so the linear_op will fail. In the quantized linear forward function above, the weight buffer is cast to float point, which also can cause this issue.

For now, I am using XNN backend instead.

Jul 18 '24 00:07 leigao97

Yes, the modification above was not correct. The reason why I encounter this issue is that I would like to run int8 weight only quantized model on QNN backend by following this procedure:

executorch/examples/models/llama2/source_transformation/quantize.py

Line 355 in b448254

class WeightOnlyInt8Linear(torch.nn.Module): I found the root reason is that if we perform any operation on the buffer, then there will be an operator, and the output of that operator doesn't have a parameter value, so the linear_op will fail. In the quantized linear forward function above, the weight buffer is cast to float point, which also can cause this issue.

For now, I am using XNN backend instead.

Is my understanding correct, that QNN backend assumes that linear op's weight input will always be a parameter and not a "varaible' coming from the graph itself. Is that the issue?

Feb 05 '25 18:02 kimishpatel