tvm [Bug] [OpenCL] [Mali]: Models with too many nodes failed to profile/run with CL_OUT_OF_HOST

[Bug] [OpenCL] [Mali]: Models with too many nodes failed to profile/run with CL_OUT_OF_HOST_MEMORY error

Open happyme531 opened this issue 1 year ago • 0 comments

trafficstars

If a Graph executor model have too many node, tvmc run with --profile option will hang, and if you trace its OpenCL call, it shows CL_OUT_OF_HOST_MEMORY error on the clFlush call after a lot of clEnqueueNDRangeKernel / clSetKernelArg calls. But running this model normally without profiling works.
If the model have even more nodes, running this model normally will fail either with same error.
The cause should be generally same as https://github.com/apache/tvm/issues/16276
The available system RAM size does not affect the issue. So probably it is a man-made limitation inside ARM vendor GPU driver?
Current workaround: manually split the model graph into small subgraphs.
Could be solved by limiting kernel launch queue size?

Expected behavior

Models running normally

Actual behavior

Models stuck when profile... ... or run with CL_OUT_OF_HOST_MEMORY error.

2024-02-07 02:39:55.381 INFO load_module /tmp/tmp86s5qxl8/mod.so
arm_release_ver: g13p0-01eac0, rk_so_ver: 6
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/firefly/tvm/python/tvm/driver/tvmc/__main__.py", line 24, in <module>
    tvmc.main.main()
  File "/home/firefly/tvm/python/tvm/driver/tvmc/main.py", line 118, in main
    sys.exit(_main(sys.argv[1:]))
  File "/home/firefly/tvm/python/tvm/driver/tvmc/main.py", line 106, in _main
    return args.func(args)
  File "/home/firefly/tvm/python/tvm/driver/tvmc/runner.py", line 282, in drive_run
    result = run_module(
  File "/home/firefly/tvm/python/tvm/driver/tvmc/runner.py", line 706, in run_module
    times = module.benchmark(dev, number=number, repeat=repeat, end_to_end=end_to_end)
  File "/home/firefly/tvm/python/tvm/contrib/graph_executor.py", line 505, in benchmark
    return self.module.time_evaluator(
  File "/home/firefly/tvm/python/tvm/runtime/module.py", line 397, in evaluator
    blob = feval(*args)
  File "/home/firefly/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/home/firefly/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/home/firefly/tvm/src/runtime/rpc/rpc_module.cc", line 291, in tvm::runtime::RPCWrappedFunc::WrapRemoteReturnToValue(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
    *rv = PackedFunc([wf](TVMArgs args, TVMRetValue* rv) { return wf->operator()(args, rv); });
  File "/home/firefly/tvm/src/runtime/rpc/rpc_module.cc", line 129, in tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
    sess_->CallFunc(handle_, values.data(), type_codes.data(), args.size(), set_return);
  File "/home/firefly/tvm/src/runtime/rpc/rpc_local_session.cc", line 91, in tvm::runtime::LocalSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
    pf->CallPacked(TVMArgs(arg_values, arg_type_codes, num_args), &rv);
  File "/home/firefly/tvm/src/runtime/profiling.cc", line 888, in tvm::runtime::profiling::WrapTimeEvaluator(tvm::runtime::PackedFunc, DLDevice, int, int, int, int, int, int, int, tvm::runtime::PackedFunc)::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
    DeviceAPI::Get(dev)->StreamSync(dev, nullptr);
  File "/home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc", line 387, in tvm::runtime::cl::OpenCLWorkspace::StreamSync(DLDevice, void*)
    OPENCL_CALL(clFinish(this->GetQueue(dev)));
tvm.error.InternalError: Traceback (most recent call last):
  4: tvm::runtime::RPCWrappedFunc::WrapRemoteReturnToValue(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /home/firefly/tvm/src/runtime/rpc/rpc_module.cc:291
  3: tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /home/firefly/tvm/src/runtime/rpc/rpc_module.cc:129
  2: tvm::runtime::LocalSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
        at /home/firefly/tvm/src/runtime/rpc/rpc_local_session.cc:91
  1: tvm::runtime::profiling::WrapTimeEvaluator(tvm::runtime::PackedFunc, DLDevice, int, int, int, int, int, int, int, tvm::runtime::PackedFunc)::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
        at /home/firefly/tvm/src/runtime/profiling.cc:888
  0: tvm::runtime::cl::OpenCLWorkspace::StreamSync(DLDevice, void*)
        at /home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc:387
  File "/home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc", line 387
InternalError: Check failed: (e == CL_SUCCESS) is false: OpenCL Error, code=-6: CL_OUT_OF_HOST_MEMORY

Environment

RK3588 SoC with Mali-G610 MP4 GPU ARM vendor GPU driver, OpenCL 3.0 Debian 11 TVM master branch

Steps to reproduce

Create a model with many nodes: (I does not know how to use the TVM Relay graph API now, and sadly ChatGPT does not either. so use ONNX)

import onnx
from onnx import helper, TensorProto, numpy_helper
import numpy

cnt = 250

# Create input tensor
input_shape = [1, 1024]
input_name = "input"
input_tensor = helper.make_tensor_value_info(input_name, TensorProto.FLOAT, input_shape)

# Create constant tensor
constant_shape = [1, 1024]
initializer = helper.make_tensor("constant", TensorProto.FLOAT, constant_shape, numpy.ones(constant_shape))

# Create vector add nodes
nodes = []
for i in range(cnt):
    node_name = f"add_{i}"
    node = helper.make_node("Add", inputs=[input_name, "constant"], outputs=[node_name], name=node_name)
    input_name = node_name
    nodes.append(node)
    # add a identity node to avoid the OOM
    node_name = f"identity_{i}"
    node = helper.make_node("Identity", inputs=[input_name], outputs=[node_name], name=node_name)
    input_name = node_name

    nodes.append(node)

# Create output tensor
output_name = nodes[-1].output[0]
output_tensor = helper.make_tensor_value_info(output_name, TensorProto.FLOAT, input_shape)

# Set opset version to 16
opset_version = 16
opset_imports = [helper.make_opsetid("", opset_version)]

# Create ONNX model
model = helper.make_model(
    helper.make_graph(
        nodes,
        "vector_add_model",
        [input_tensor],
        [output_tensor],
        initializer=[initializer]
    ),
    producer_name="vector_add_model",
    opset_imports=opset_imports
)

# Save ONNX model to a file
onnx.save(model, "vector_add_model(" + str(cnt*2) + " nodes).onnx")

(or just download vector_add_model.zip)

Compile and profile run the model with 500 nodes: tvmc compile --target "opencl -device=mali" --output test500.tar -O 0 --dump-code relay,tir "vector_add_model(500 nodes).onnx" python -m tvm.driver.tvmc run --print-time --device cl --repeat 4 --profile ./test500.tar (it works)
try again with 2000 nodes: stuck with --profile, but runs okay
try again with 5000 nodes: stuck with --profile and run failed

Triage

backend: opencl
flow: graph

cc @echuraev @elvin-n

Feb 06 '24 18:02 happyme531

tvm tvm copied to clipboard

[Bug] [OpenCL] [Mali]: Models with too many nodes failed to profile/run with CL_OUT_OF_HOST_MEMORY error

Expected behavior

Actual behavior

Environment

Steps to reproduce

Triage

tvm
tvm copied to clipboard