tvm
tvm copied to clipboard
[Bug] [OpenCL] [Mali]: Models with too many nodes failed to profile/run with CL_OUT_OF_HOST_MEMORY error
-
If a Graph executor model have too many node,
tvmc runwith--profileoption will hang, and if you trace its OpenCL call, it shows CL_OUT_OF_HOST_MEMORY error on theclFlushcall after a lot ofclEnqueueNDRangeKernel/clSetKernelArgcalls. But running this model normally without profiling works. -
If the model have even more nodes, running this model normally will fail either with same error.
-
The cause should be generally same as https://github.com/apache/tvm/issues/16276
-
The available system RAM size does not affect the issue. So probably it is a man-made limitation inside ARM vendor GPU driver?
-
Current workaround: manually split the model graph into small subgraphs.
-
Could be solved by limiting kernel launch queue size?
Expected behavior
Models running normally
Actual behavior
Models stuck when profile... ... or run with CL_OUT_OF_HOST_MEMORY error.
2024-02-07 02:39:55.381 INFO load_module /tmp/tmp86s5qxl8/mod.so
arm_release_ver: g13p0-01eac0, rk_so_ver: 6
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/firefly/tvm/python/tvm/driver/tvmc/__main__.py", line 24, in <module>
tvmc.main.main()
File "/home/firefly/tvm/python/tvm/driver/tvmc/main.py", line 118, in main
sys.exit(_main(sys.argv[1:]))
File "/home/firefly/tvm/python/tvm/driver/tvmc/main.py", line 106, in _main
return args.func(args)
File "/home/firefly/tvm/python/tvm/driver/tvmc/runner.py", line 282, in drive_run
result = run_module(
File "/home/firefly/tvm/python/tvm/driver/tvmc/runner.py", line 706, in run_module
times = module.benchmark(dev, number=number, repeat=repeat, end_to_end=end_to_end)
File "/home/firefly/tvm/python/tvm/contrib/graph_executor.py", line 505, in benchmark
return self.module.time_evaluator(
File "/home/firefly/tvm/python/tvm/runtime/module.py", line 397, in evaluator
blob = feval(*args)
File "/home/firefly/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "/home/firefly/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
File "/home/firefly/tvm/src/runtime/rpc/rpc_module.cc", line 291, in tvm::runtime::RPCWrappedFunc::WrapRemoteReturnToValue(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
*rv = PackedFunc([wf](TVMArgs args, TVMRetValue* rv) { return wf->operator()(args, rv); });
File "/home/firefly/tvm/src/runtime/rpc/rpc_module.cc", line 129, in tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
sess_->CallFunc(handle_, values.data(), type_codes.data(), args.size(), set_return);
File "/home/firefly/tvm/src/runtime/rpc/rpc_local_session.cc", line 91, in tvm::runtime::LocalSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
pf->CallPacked(TVMArgs(arg_values, arg_type_codes, num_args), &rv);
File "/home/firefly/tvm/src/runtime/profiling.cc", line 888, in tvm::runtime::profiling::WrapTimeEvaluator(tvm::runtime::PackedFunc, DLDevice, int, int, int, int, int, int, int, tvm::runtime::PackedFunc)::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
DeviceAPI::Get(dev)->StreamSync(dev, nullptr);
File "/home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc", line 387, in tvm::runtime::cl::OpenCLWorkspace::StreamSync(DLDevice, void*)
OPENCL_CALL(clFinish(this->GetQueue(dev)));
tvm.error.InternalError: Traceback (most recent call last):
4: tvm::runtime::RPCWrappedFunc::WrapRemoteReturnToValue(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
at /home/firefly/tvm/src/runtime/rpc/rpc_module.cc:291
3: tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const
at /home/firefly/tvm/src/runtime/rpc/rpc_module.cc:129
2: tvm::runtime::LocalSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)
at /home/firefly/tvm/src/runtime/rpc/rpc_local_session.cc:91
1: tvm::runtime::profiling::WrapTimeEvaluator(tvm::runtime::PackedFunc, DLDevice, int, int, int, int, int, int, int, tvm::runtime::PackedFunc)::$_0::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
at /home/firefly/tvm/src/runtime/profiling.cc:888
0: tvm::runtime::cl::OpenCLWorkspace::StreamSync(DLDevice, void*)
at /home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc:387
File "/home/firefly/tvm/src/runtime/opencl/opencl_device_api.cc", line 387
InternalError: Check failed: (e == CL_SUCCESS) is false: OpenCL Error, code=-6: CL_OUT_OF_HOST_MEMORY
Environment
RK3588 SoC with Mali-G610 MP4 GPU ARM vendor GPU driver, OpenCL 3.0 Debian 11 TVM master branch
Steps to reproduce
- Create a model with many nodes: (I does not know how to use the TVM Relay graph API now, and sadly ChatGPT does not either. so use ONNX)
import onnx
from onnx import helper, TensorProto, numpy_helper
import numpy
cnt = 250
# Create input tensor
input_shape = [1, 1024]
input_name = "input"
input_tensor = helper.make_tensor_value_info(input_name, TensorProto.FLOAT, input_shape)
# Create constant tensor
constant_shape = [1, 1024]
initializer = helper.make_tensor("constant", TensorProto.FLOAT, constant_shape, numpy.ones(constant_shape))
# Create vector add nodes
nodes = []
for i in range(cnt):
node_name = f"add_{i}"
node = helper.make_node("Add", inputs=[input_name, "constant"], outputs=[node_name], name=node_name)
input_name = node_name
nodes.append(node)
# add a identity node to avoid the OOM
node_name = f"identity_{i}"
node = helper.make_node("Identity", inputs=[input_name], outputs=[node_name], name=node_name)
input_name = node_name
nodes.append(node)
# Create output tensor
output_name = nodes[-1].output[0]
output_tensor = helper.make_tensor_value_info(output_name, TensorProto.FLOAT, input_shape)
# Set opset version to 16
opset_version = 16
opset_imports = [helper.make_opsetid("", opset_version)]
# Create ONNX model
model = helper.make_model(
helper.make_graph(
nodes,
"vector_add_model",
[input_tensor],
[output_tensor],
initializer=[initializer]
),
producer_name="vector_add_model",
opset_imports=opset_imports
)
# Save ONNX model to a file
onnx.save(model, "vector_add_model(" + str(cnt*2) + " nodes).onnx")
(or just download vector_add_model.zip)
- Compile and profile run the model with 500 nodes:
tvmc compile --target "opencl -device=mali" --output test500.tar -O 0 --dump-code relay,tir "vector_add_model(500 nodes).onnx"python -m tvm.driver.tvmc run --print-time --device cl --repeat 4 --profile ./test500.tar(it works) - try again with 2000 nodes: stuck with --profile, but runs okay
- try again with 5000 nodes: stuck with --profile and run failed
Triage
- backend: opencl
- flow: graph
cc @echuraev @elvin-n