pytorch
pytorch copied to clipboard
RuntimeError: r INTERNAL ASSERT FAILED at "../aten/src/ATen/core/jit_type_base.h":545, please report a bug to PyTorch.
🐛 Describe the bug
Hello there I am trying to convert Maxim model to ONNX which accepts dynamic shapes. I first script my model using torch.jit.script and then try to convert the scripted model to ONNX. In convertion I get the following error
To Convert my model I use the following code:
class Compile:
def __init__(self, configs):
self.configs = configs
onnx_folder = os.path.join(self.configs.paths.result_path, self.configs.paths.onnx_folder)
if not os.path.exists(onnx_folder):
os.makedirs(onnx_folder)
self.onnx_path = os.path.join(onnx_folder, f"{self.mode}_{self.configs.paths.onnx_path}")
self.__load_model()
self.convert()
self.__create_session()
def __load_model(self):
self.device = 'cuda'
self.original_model = MAXIM(num_stages=2, num_supervision_scales=1).to(self.device)
with torch.no_grad():
self.original_model.eval()
self.scripted_model = torch.jit.script(self.original_model)
logger.info("model was loaded successfully")
def __check_onnx(self):
model = onnx.load(self.onnx_path)
try:
onnx.checker.check_model(model)
except Exception as e:
logger.error(f"{e}")
else:
logger.info("passed successsfully")
def convert(self):
logger.info("converting to onnx ...")
input_names = ["input"]
output_names = ["output"]
self.dummy_input = torch.rand(1, 3, self.configs.compile.w_input, self.configs.compile.h_input)
torch.cuda.empty_cache()
torch.cuda.synchronize()
with torch.no_grad():
self.scripted_model.eval()
dynamic_axes = {"input": {0: 'batch', 1: 'channels', 2: 'width', 3: 'height'},
"output": {0: 'batch', 1: 'channels', 2: 'width', 3: 'height'}} # adding names for better debugging
torch_onnx.export(self.scripted_model, self.dummy_input.to("cuda"), self.onnx_path, verbose=True,
input_names=input_names,
output_names=output_names,
dynamic_axes=dynamic_axes,
do_constant_folding=True,
opset_version=13,
training=TrainingMode.EVAL
)
self.__check_onnx()
logger.info("______________module was converted to onnx successfully___________ ")
def __create_session(self):
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
provider_options = None
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
self.session = ort.InferenceSession(self.onnx_path, sess_options=session_options,
provider_options=provider_options, providers=providers)
Versions
PyTorch version: 1.12.1+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.1 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: version 3.24.1 Libc version: glibc-2.35
Python version: 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Nvidia driver version: 510.39.01 cuDNN version: Probably one of the following: /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.4.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.23.1 [pip3] pytorch-quantization==2.1.2 [pip3] pytorch-triton==2.0.0+0d7e753227 [pip3] torch==1.12.1+cu113 [pip3] torchaudio==0.12.1+cu113 [pip3] torchmetrics==0.9.3 [pip3] torchsummary==1.5.1 [pip3] torchvision==0.13.1+cu113 [conda] mxnet-mkl 1.6.0 pypi_0 pypi [conda] numpy 1.23.1 pypi_0 pypi [conda] pytorch-quantization 2.1.2 pypi_0 pypi [conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi [conda] torch 1.12.1+cu113 pypi_0 pypi [conda] torchaudio 0.12.1+cu113 pypi_0 pypi [conda] torchmetrics 0.9.3 pypi_0 pypi [conda] torchsummary 1.5.1 pypi_0 pypi [conda] torchvision 0.13.1+cu113 pypi_0 pypi
cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel
I also get the same error using pytorch 1.13 PyTorch version: 1.13.1+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.1 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: version 3.24.1 Libc version: glibc-2.35
Python version: 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Nvidia driver version: 510.39.01 cuDNN version: Probably one of the following: /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.4.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.23.1 [pip3] pytorch-quantization==2.1.2 [pip3] pytorch-triton==2.0.0+0d7e753227 [pip3] torch==1.13.1+cu116 [pip3] torchaudio==0.13.1+cu116 [pip3] torchmetrics==0.9.3 [pip3] torchsummary==1.5.1 [pip3] torchvision==0.14.1+cu116 [conda] mxnet-mkl 1.6.0 pypi_0 pypi [conda] numpy 1.23.1 pypi_0 pypi [conda] pytorch-quantization 2.1.2 pypi_0 pypi [conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi [conda] torch 1.13.1+cu116 pypi_0 pypi [conda] torchaudio 0.13.1+cu116 pypi_0 pypi [conda] torchmetrics 0.9.3 pypi_0 pypi [conda] torchsummary 1.5.1 pypi_0 pypi [conda] torchvision 0.14.1+cu116 pypi_0 pypi
I have fixed this error by converting all my // divisions in the code to /. but now I get the following error : onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./onnx/results/onnx_models/maxim_model.onnx failed:This is an invalid model. Type Error: Type 'tensor(float)' of input parameter (/unetencoderblock00/residualsplitheadmultiaxisgmlpLayer/Div_1_output_0) of operator (SplitToSequence) in node (/unetencoderblock00/residualsplitheadmultiaxisgmlpLayer/SplitToSequence) is invalid.
I think it stills force casting to float in some points although I have used int in casting in many parts of my code, I can still see in onnx graph that it has been casted to float. It also happens in Mul nodes. where I multiply two float number and then cast them to int. it seems that this cast does not happen or then forces to float
I found what the problem was. apparently, when we cast a value to int by using int(value). it turns into float in onnx graph. I ommited all my casts to int in subtraction using / and used // instead and now my problem is solved.
Could you provide a full repro? the provided snippet is not enough to copy-paste-run :) In fact, there is not a model in the bug report to debug
Thank you very much for your answer. I found the issue. It seems that whenever I have used int() for casting my variable to integer (for example in int(a/b)). it is casted to float in onnx graph. when I omit int and use // instead of /. my problem is solved.
I found the issue. It seems that whenever I have used int() for casting my variable to integer (for example in int(a/b)). it is casted to float in onnx graph. when I omit int and use // instead of /. my problem is solved.
Great you figured out for your model. We are looking for an opportunity to fix this on pytorch side to prevent users from having to change their code, so if you could share a repro, we could try to have this fixed
Closing due to lack of repro