TensorRT
TensorRT copied to clipboard
🐛 [Bug] Segmentation Fault When Trying to Quantize ResNet50 model
Bug Description
I'm using torch_tensorrt to try and quantize a pretrained ResNet50 model (roughly following the steps here), but I am getting a segmentation fault. I've tried running the code on two different machines using the latest docker image here but get the same segmentation fault on both. Also, when I try to compile the model to TensorRT with fp16 instead of quantizing it works fine.
To Reproduce
Reduced example code:
main.py:
import torch
import torch.utils.data as data
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torch_tensorrt
from pytorch_quantization import nn as quant_nn
from pytorch_quantization import quant_modules
from cifar10_models.resnet import resnet50
from utils import calibrate_model
testing_dataset = datasets.CIFAR10(root='./data',
train=False,
download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
]))
testing_dataloader = torch.utils.data.DataLoader(testing_dataset,
batch_size=16,
shuffle=False,
num_workers=2)
# INITIALIZE QUANTIZATION
quant_modules.initialize()
model_quantized = resnet50(pretrained=True).cuda().eval()
calibrate_model(model_quantized, testing_dataloader, num_calib_batch=32, calibrator="max")
#export model to torchscript
quant_nn.TensorQuantizer.use_fb_fake_quant = True
with torch.no_grad():
data = iter(testing_dataloader)
images, _ = data.next()
jit_model = torch.jit.trace(model_quantized, images.to("cuda"))
#build tensorRT model
compile_spec = {"inputs": [torch_tensorrt.Input([16, 3, 32, 32])],
"enabled_precisions": torch.int8}
model_tensorrt = torch_tensorrt.compile(jit_model, **compile_spec)
with utils.py having:
def collect_stats(model, data_loader, num_batches):
"""Feed data to the network and collect statistic"""
# Enable calibrators
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
module.disable_quant()
module.enable_calib()
else:
module.disable()
for i, (image, _) in tqdm(enumerate(data_loader), total=num_batches):
model(image.cuda())
if i >= num_batches:
break
# Disable calibrators
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
module.enable_quant()
module.disable_calib()
else:
module.enable()
def compute_amax(model, **kwargs):
# Load calib result
for name, module in model.named_modules():
if isinstance(module, quant_nn.TensorQuantizer):
if module._calibrator is not None:
if isinstance(module._calibrator, calib.MaxCalibrator):
module.load_calib_amax()
else:
module.load_calib_amax(**kwargs)
print(F"{name:40}: {module}")
model.cuda()
def calibrate_model(model, data_loader, num_calib_batch, calibrator, hist_percentile=None):
if num_calib_batch > 0:
print("Calibrating model")
with torch.no_grad():
collect_stats(model, data_loader, num_calib_batch)
if calibrator == "percentile":
compute_amax(model, method=calibrator, percentile=hist_percentile)
else:
compute_amax(model, method=calibrator)
When I run main.py when it gets to the point of compiling the model I get:
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
If I remove quant_modules.initialize()
and calibrate_model(...)
and change the enabled precision instead to torch.float16
the model compiles without any error.
Expected behavior
Would expect the int8 quantized model to compile without issue.
Environment
Using the nvidia docker image (22.02-py3). Tested on a GTX-3090 GPU and GTX 1650.
Additional context
@peri044 Can you take a look at this?
Might be related to #898? @henrycharlesworth Can you try building and running Torch-TensorRT with the TensorRT NGC containers?
I was able to bypass my issue using nvcr.io/nvidia/tensorrt:22.02-py3
, PyTorch 1.10, and Torch-TensorRT commit 11bcb98d
on master.
I'm likewise getting a segmentation fault in torch_tensorrt.compile
when trying to convert a model to int8. The issue does not occur with float16 or float32. I haven't tried building from source yet with debugging symbols, but gdb
tracked it to libtorchtrt.so
and torch_tensorrt::core::MapInputsAndDetermineDTypes()
.
I'm on Torch 1.11.0+cu115, Torch-TensorRT 1.1.0.
I traced the segfault in my case to line 314 here: https://github.com/pytorch/TensorRT/blob/40f8b44d95e1bf0912757377eb6acba666963e9d/core/compiler.cpp#L311-L316
As far as I can tell, first_use_type_map
lacks the key in
, and as a result doing ->second
on the result of .find(in)
is invalid. The code appears to be trying to check for this case, but that point it's too late.
I have a very hasty patch that gets me past this point (I'll do a PR if anyone wants, but I don't know if I'm actually solving much), but it then just leads me to https://github.com/pytorch/TensorRT/issues/922.
diff --git a/core/compiler.cpp b/core/compiler.cpp
index b684b808..0d82bf11 100644
--- a/core/compiler.cpp
+++ b/core/compiler.cpp
@@ -311,8 +311,9 @@ void MapInputsAndDetermineDTypes(
for (auto& in : g->inputs()) {
if (static_params.find(in) == static_params.end()) {
ir::Input& spec = cfg.convert_info.inputs.find(in)->second;
- auto est_type_opt = first_use_type_map.find(in)->second;
- if (est_type_opt && !spec.dtype_is_user_defined) {
+ auto count = first_use_type_map.count(in);
+ if (count && !spec.dtype_is_user_defined) {
+ auto est_type_opt = first_use_type_map.find(in)->second;
// If we can calculate the type from the graph and the type was not defined by the user then use the calculated
// type
LOG_INFO(
@@ -320,17 +321,18 @@ void MapInputsAndDetermineDTypes(
<< in->debugName() << " has type " << est_type_opt.value()
<< ". If this is incorrect explicitly set dtype for input and file a bug");
spec.dtype = util::ScalarTypeToTRTDataType(est_type_opt.value());
- } else if (!est_type_opt && !spec.dtype_is_user_defined) {
+ } else if (!count && !spec.dtype_is_user_defined) {
// If we cannot calculate the type and the user did not define the type, then default to FP32
LOG_WARNING(
"Cannot infer input type from calcuations in graph for input "
<< in->debugName() << ". Assuming it is Float32. If not, specify input type explicity");
spec.dtype = nvinfer1::DataType::kFLOAT;
} else if (spec.dtype_is_user_defined && cfg.partition_info.enabled) {
- if (!est_type_opt) {
+ if (!count) {
LOG_INFO("Cannot infer input tensor dtype in graph. Using user provided input dtype settings");
first_use_type_map[in] = {util::TRTDataTypeToScalarType(cfg.convert_info.inputs.find(in)->second.dtype)};
} else {
+ auto est_type_opt = first_use_type_map.find(in)->second;
if (util::TRTDataTypeToScalarType(cfg.convert_info.inputs.find(in)->second.dtype) != est_type_opt.value()) {
std::stringstream ss;
ss << "For input " << in->debugName() << ", found user specified input dtype as ";
Hi,
We faced the int8 bug too, in the official docker image, version 22.05. The initial issue is solved with @Hodapp87's patch, but in our case it leads to another issue, not #922 as reported by @Hodapp87.
This is the exception traceback:
Traceback (most recent call last):
File "./main.py", line 19, in <module>
trt_ts_module = torch_tensorrt.compile(
File "/usr/local/lib/python3.8/dist-packages/torch_tensorrt/_compile.py", line 109, in compile
return torch_tensorrt.ts.compile(ts_mod, inputs=inputs, enabled_precisions=enabled_precisions, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch_tensorrt/ts/_compiler.py", line 113, in compile
compiled_cpp_mod = _C.compile_graph(module._c, _parse_compile_spec(spec))
RuntimeError: [Error thrown at core/conversion/var/Var.cpp:132] Expected isITensor() to be true but got false
Requested ITensor from Var, however Var type is c10::IValue
Anyone knows how to solve this?
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
@peri044 Any news on this issue? We still cannot use our models with int8 precision because of this bug.
@peri044 can we please confirm the PTQ notebook is working properly, then go after this bug? P1
This issue has not seen activity for 90 days, Remove stale label or comment or this will be closed in 10 days
We think this is fixed. Dheeraj to check.
Thanks, I will check as soon as possible.
@ivan94fi have you been able to check? We would like to close this out.
Hi, I can confirm that our model is now correctly converted when using int8 precision with version 1.3.0 of Torch-TensorRT. Thank you!