tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[Bug] [OpenCL] [Mali]: Models with too many nodes failed to profile/run with CL_OUT_OF_HOST_MEMORY error

Open happyme531 opened this issue 1 year ago • 0 comments
trafficstars

  • If a Graph executor model have too many node, tvmc run with --profile option will hang, and if you trace its OpenCL call, it shows CL_OUT_OF_HOST_MEMORY error on the clFlush call after a lot of clEnqueueNDRangeKernel / clSetKernelArg calls. But running this model normally without profiling works.

  • If the model have even more nodes, running this model normally will fail either with same error.

  • The cause should be generally same as https://github.com/apache/tvm/issues/16276

  • The available system RAM size does not affect the issue. So probably it is a man-made limitation inside ARM vendor GPU driver?

  • Current workaround: manually split the model graph into small subgraphs.

  • Could be solved by limiting kernel launch queue size?

Expected behavior

Models running normally

Actual behavior

Models stuck when profile... ... or run with CL_OUT_OF_HOST_MEMORY error.

2024-02-07 02:39:55.381 INFO load_module /tmp/tmp86s5qxl8/mod.so
arm_release_ver: g13p0-01eac0, rk_so_ver: 6
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/firefly/tvm/python/tvm/driver/tvmc/__main__.py", line 24, in <module>
    tvmc.main.main()
  File "/home/firefly/tvm/python/tvm/driver/tvmc/main.py", line 118, in main
    sys.exit(_main(sys.argv[1:]))
  File "/home/firefly/tvm/python/tvm/driver/tvmc/main.py", line 106, in _main
    return args.func(args)
  File "/home/firefly/tvm/python/tvm/driver/tvmc/runner.py", line 282, in drive_run
    result = run_module(
  File "/home/firefly/tvm/python/tvm/driver/tvmc/runner.py", line 706, in run_module
    times = module.benchmark(dev, number=number, repeat=repeat, end_to_end=end_to_end)
  File "/home/firefly/tvm/python/tvm/contrib/graph_executor.py", line 505, in benchmark
    return self.module.time_evaluator(
  File "/home/firefly/tvm/python/tvm/runtime/module.py", line 397, in evaluator
    blob = feval(*args)
  File "/home/firefly/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/home/firefly/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/home/firefly/tvm/src/runtime/rpc/rpc_module.cc", line 291, in tvm::runtime::RPCWrappedFunc::WrapRemoteReturnToValue(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
    *rv = PackedFunc([wf](TVMArgs args, TVMRetValue* rv) { return wf->operator()(args, rv); });
  File "/home/firefly/tvm/src/runtime/rpc/rpc_module.cc", line 129, in tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
    sess_->CallFunc(handle_, values.data(), type_codes.data(), args.size(), set_return);
  File "/home/firefly/tvm/src/runtime/rpc/rpc_local_session.cc", line 91, in tvm::runtime::LocalSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
    pf->CallPacked(TVMArgs(arg_values, arg_type_codes, num_args), &rv);
  File "/home/firefly/tvm/src/runtime/profiling.cc", line 888, in tvm::runtime::profiling::WrapTimeEvaluator(tvm::runtime::PackedFunc, DLDevice, int, int, int, int, int, int, int, tvm::runtime::PackedFunc)::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
    DeviceAPI::Get(dev)->StreamSync(dev, nullptr);
  File "/home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc", line 387, in tvm::runtime::cl::OpenCLWorkspace::StreamSync(DLDevice, void*)
    OPENCL_CALL(clFinish(this->GetQueue(dev)));
tvm.error.InternalError: Traceback (most recent call last):
  4: tvm::runtime::RPCWrappedFunc::WrapRemoteReturnToValue(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /home/firefly/tvm/src/runtime/rpc/rpc_module.cc:291
  3: tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
        at /home/firefly/tvm/src/runtime/rpc/rpc_module.cc:129
  2: tvm::runtime::LocalSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
        at /home/firefly/tvm/src/runtime/rpc/rpc_local_session.cc:91
  1: tvm::runtime::profiling::WrapTimeEvaluator(tvm::runtime::PackedFunc, DLDevice, int, int, int, int, int, int, int, tvm::runtime::PackedFunc)::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
        at /home/firefly/tvm/src/runtime/profiling.cc:888
  0: tvm::runtime::cl::OpenCLWorkspace::StreamSync(DLDevice, void*)
        at /home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc:387
  File "/home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc", line 387
InternalError: Check failed: (e == CL_SUCCESS) is false: OpenCL Error, code=-6: CL_OUT_OF_HOST_MEMORY

Environment

RK3588 SoC with Mali-G610 MP4 GPU ARM vendor GPU driver, OpenCL 3.0 Debian 11 TVM master branch

Steps to reproduce

  • Create a model with many nodes: (I does not know how to use the TVM Relay graph API now, and sadly ChatGPT does not either. so use ONNX)
import onnx
from onnx import helper, TensorProto, numpy_helper
import numpy

cnt = 250

# Create input tensor
input_shape = [1, 1024]
input_name = "input"
input_tensor = helper.make_tensor_value_info(input_name, TensorProto.FLOAT, input_shape)

# Create constant tensor
constant_shape = [1, 1024]
initializer = helper.make_tensor("constant", TensorProto.FLOAT, constant_shape, numpy.ones(constant_shape))

# Create vector add nodes
nodes = []
for i in range(cnt):
    node_name = f"add_{i}"
    node = helper.make_node("Add", inputs=[input_name, "constant"], outputs=[node_name], name=node_name)
    input_name = node_name
    nodes.append(node)
    # add a identity node to avoid the OOM
    node_name = f"identity_{i}"
    node = helper.make_node("Identity", inputs=[input_name], outputs=[node_name], name=node_name)
    input_name = node_name

    nodes.append(node)

# Create output tensor
output_name = nodes[-1].output[0]
output_tensor = helper.make_tensor_value_info(output_name, TensorProto.FLOAT, input_shape)

# Set opset version to 16
opset_version = 16
opset_imports = [helper.make_opsetid("", opset_version)]

# Create ONNX model
model = helper.make_model(
    helper.make_graph(
        nodes,
        "vector_add_model",
        [input_tensor],
        [output_tensor],
        initializer=[initializer]
    ),
    producer_name="vector_add_model",
    opset_imports=opset_imports
)

# Save ONNX model to a file
onnx.save(model, "vector_add_model(" + str(cnt*2) + " nodes).onnx")

(or just download vector_add_model.zip)

  • Compile and profile run the model with 500 nodes: tvmc compile --target "opencl -device=mali" --output test500.tar -O 0 --dump-code relay,tir "vector_add_model(500 nodes).onnx" python -m tvm.driver.tvmc run --print-time --device cl --repeat 4 --profile ./test500.tar (it works)
  • try again with 2000 nodes: stuck with --profile, but runs okay
  • try again with 5000 nodes: stuck with --profile and run failed

Triage

  • backend: opencl
  • flow: graph

cc @echuraev @elvin-n

happyme531 avatar Feb 06 '24 18:02 happyme531