TensorRT
TensorRT copied to clipboard
Engine build failure "cuda misaligned address" of TensorRT 10.1 when running fp16 group normalization with particular value of `num_groups`
Description
I believe this is regression of 10.1 on the normalization layer. It happens when FP16 mode is on and particular value of num_groups. When building a network with group norm in FP16 mode and a particular value of num_groups, which seems to be not a multiple of 8, the build failed with "Cuda Runtime (misaligned address)"
[06/20/2024-17:52:33] [TRT] [E] [builderUtils.cpp::operator()::969] Error Code 1: Cuda Runtime (misaligned address)
[06/20/2024-17:52:33] [TRT] [E] [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (misaligned address)
[06/20/2024-17:52:33] [TRT] [E] [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:432] Error 716 destroying stream '0x559119d1b360'.)
[06/20/2024-17:52:33] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception misaligned address
[06/20/2024-17:52:33] [TRT] [W] Unable to determine GPU memory usage: misaligned address
[06/20/2024-17:52:33] [TRT] [W] Unable to determine GPU memory usage: misaligned address
[06/20/2024-17:52:33] [TRT] [W] Unable to determine GPU memory usage: misaligned address
[06/20/2024-17:52:33] [TRT] [W] Unable to determine GPU memory usage: misaligned address
[06/20/2024-17:52:33] [TRT] [E] Error Code: 9: Skipping tactic 0x0000000000000000 due to exception [autotuner.cpp:autotuner_t:460] In the autotuner, CUDA error 716 from 'cuMemHostAlloc(&reinterpret_cast<void*&>(wait_kerne_host_mem_ptr_), sizeof(*wait_kerne_host_mem_ptr_), CU_MEMHOSTALLOC_DEVICEMAP)': misaligned address.
[06/20/2024-17:52:33] [TRT] [E] [virtualMemoryBuffer.cpp::getFreeMemSize::183] Error Code 1: Cuda Runtime (misaligned address)
[06/20/2024-17:52:33] [TRT] [W] Requested amount of GPU memory (15728640 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[06/20/2024-17:52:33] [TRT] [W] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 15728640 detected for tactic 0x00000000000003e8.
[06/20/2024-17:52:33] [TRT] [E] [virtualMemoryBuffer.cpp::getFreeMemSize::183] Error Code 1: Cuda Runtime (misaligned address)
[06/20/2024-17:52:33] [TRT] [W] Requested amount of GPU memory (15728640 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[06/20/2024-17:52:33] [TRT] [W] UNSUPPORTED_STATE: Skipping tactic 1 due to insufficient memory on requested size of 15728640 detected for tactic 0x00000000000003ea.
[06/20/2024-17:52:33] [TRT] [E] [virtualMemoryBuffer.cpp::getFreeMemSize::183] Error Code 1: Cuda Runtime (misaligned address)
[06/20/2024-17:52:33] [TRT] [W] Requested amount of GPU memory (15728640 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[06/20/2024-17:52:33] [TRT] [W] UNSUPPORTED_STATE: Skipping tactic 2 due to insufficient memory on requested size of 15728640 detected for tactic 0x0000000000000000.
[06/20/2024-17:52:33] [TRT] [E] Error Code: 2: Assertion !cost.empty() failed. Impossible to reformat.
[06/20/2024-17:52:33] [TRT] [E] [scopedCudaResources.cpp::~ScopedCudaStream::43] Error Code 1: Cuda Runtime (misaligned address)
[06/20/2024-17:52:33] [TRT] [E] [optimizer.cpp::computeCosts::4519] Error Code 2: Internal Error (Assertion !cost.empty() failed. Impossible to reformat.)
Environment
TensorRT Version: 10.1
NVIDIA GPU: RTX 3070
NVIDIA Driver Version: 535.129.03
CUDA Version: 12.1
CUDNN Version: NA
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.10
Baremetal or Container (if so, version): Baremetal
Relevant Files
Steps To Reproduce
import tensorrt as trt
import numpy as np
import operator
from functools import reduce
def affine_group_norm(network, x, num_groups, scale, bias, epsilon):
ranks = len(x.shape)
_shape = [1] * ranks
_shape[1] = num_groups
dummy_w = network.add_constant(_shape, np.ones(_shape, dtype=np.float16)).get_output(0)
dummy_b = network.add_constant(_shape, np.zeros(_shape, dtype=np.float16)).get_output(0)
axesMask = reduce(operator.or_, (1 << i for i in range(2, ranks)))
norm_layer = network.add_normalization(x, dummy_w, dummy_b, axesMask=axesMask)
norm_layer.num_groups = num_groups
norm_layer.epsilon = epsilon
output = norm_layer.get_output(0)
power = np.ones_like(scale)
scale_layer = network.add_scale(output,
trt.ScaleMode.CHANNEL,
shift=trt.Weights(bias),
scale=trt.Weights(scale),
power=trt.Weights(power))
scale_layer.channel_axis = 1
output = scale_layer.get_output(0)
return output
def test_group_norm(num_groups, fp16):
builder = trt.Builder(trt.Logger())
config = builder.create_builder_config()
if fp16:
config.flags |= 1 << int(trt.BuilderFlag.FP16)
config.flags |= 1 << int(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
shape = (2, 320, 64, 64)
network = builder.create_network(0)
x = network.add_input('x', trt.float16, shape)
scale = np.random.randn(shape[1]).astype(np.float16)
bias = np.random.randn(shape[1]).astype(np.float16)
y = affine_group_norm(network, x, num_groups, scale, bias, 1e-5)
network.mark_output(y)
engine = builder.build_serialized_network(network, config)
assert engine is not None
print("pass")
test_group_norm(32, fp16=False)
test_group_norm(10, fp16=False)
test_group_norm(32, fp16=True)
test_group_norm(10, fp16=True) # fails
Usually u/int8_t and int16_t arrays are not usually aligned at 4 byte boundaries.
You can try forcing the appropriate alignment. Like change 'np.float16' to 'np.float32' in your affine_group_norm() and test_group_norm().
@lix19937 Here's a repro shows that even setting np.float32 for dummy_w, dummy_b, scale, bias, it still error with Error Code 1: Cuda Runtime (misaligned address).
import tensorrt as trt
import numpy as np
import operator
from functools import reduce
def affine_group_norm(network, x, num_groups, scale, bias, epsilon):
ranks = len(x.shape)
_shape = [1] * ranks
_shape[1] = num_groups
dummy_w = network.add_constant(_shape, np.ones(_shape, dtype=np.float32)).get_output(0)
dummy_b = network.add_constant(_shape, np.zeros(_shape, dtype=np.float32)).get_output(0)
axesMask = reduce(operator.or_, (1 << i for i in range(2, ranks)))
norm_layer = network.add_normalization(x, dummy_w, dummy_b, axesMask=axesMask)
norm_layer.num_groups = num_groups
norm_layer.epsilon = epsilon
output = norm_layer.get_output(0)
power = np.ones_like(scale)
scale_layer = network.add_scale(output,
trt.ScaleMode.CHANNEL,
shift=trt.Weights(bias),
scale=trt.Weights(scale),
power=trt.Weights(power))
scale_layer.channel_axis = 1
output = scale_layer.get_output(0)
return output
def test_group_norm(num_groups, fp16):
builder = trt.Builder(trt.Logger())
config = builder.create_builder_config()
if fp16:
config.flags |= 1 << int(trt.BuilderFlag.FP16)
config.flags |= 1 << int(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
shape = (2, 320, 64, 64)
network = builder.create_network(0)
x = network.add_input('x', trt.float16, shape)
scale = np.random.randn(shape[1]).astype(np.float32)
bias = np.random.randn(shape[1]).astype(np.float32)
y = affine_group_norm(network, x, num_groups, scale, bias, 1e-5)
network.mark_output(y)
engine = builder.build_serialized_network(network, config)
assert engine is not None
print("pass")
test_group_norm(32, fp16=False)
test_group_norm(10, fp16=False)
test_group_norm(32, fp16=True)
test_group_norm(10, fp16=True) # fails
Can you use torch api to export an onnx of affine_group_norm, then use trtexec convert ? @haijieg
Can you use torch api to export an onnx of
affine_group_norm, then use trtexec convert ? @haijieg
@lix19937 No, I want to directly control how network is built using TRT network/builder API instead of going through ONNX. This is the minimal repro without torch/onnx that shows unexpected behavior of TRT public API surface. I should not need to use onnx/torch to build a network correctly without crashing with Cuda Misalign Error.
Using TRT 10.8 I'm unable to repro the issue, all test cases pass.
@haijieg can you confirm that 10.8 works for you?
Closing due to inactive. Please feel free to reopen!