pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

RuntimeError: r INTERNAL ASSERT FAILED at "../aten/src/ATen/core/jit_type_base.h":545, please report a bug to PyTorch.

Open fatemebafghi opened this issue 2 years ago • 3 comments

🐛 Describe the bug

Hello there I am trying to convert Maxim model to ONNX which accepts dynamic shapes. I first script my model using torch.jit.script and then try to convert the scripted model to ONNX. In convertion I get the following error

Screenshot from 2023-01-08 12-32-06

To Convert my model I use the following code:

  class Compile:
    def __init__(self, configs):

        self.configs = configs
    


        onnx_folder = os.path.join(self.configs.paths.result_path, self.configs.paths.onnx_folder)
        if not os.path.exists(onnx_folder):
            os.makedirs(onnx_folder)


        self.onnx_path = os.path.join(onnx_folder, f"{self.mode}_{self.configs.paths.onnx_path}")
   
        self.__load_model()
        self.convert()
        self.__create_session()

    def __load_model(self):
       
        self.device = 'cuda'
        self.original_model = MAXIM(num_stages=2, num_supervision_scales=1).to(self.device)
  
        with torch.no_grad():
            self.original_model.eval()
          
            self.scripted_model = torch.jit.script(self.original_model)
            logger.info("model was loaded successfully")

    def __check_onnx(self):
        model = onnx.load(self.onnx_path)
        try:
            onnx.checker.check_model(model)
        except Exception as e:
            logger.error(f"{e}")
        else:
            logger.info("passed successsfully")

    def convert(self):
        logger.info("converting to onnx ...")
        input_names = ["input"]
        output_names = ["output"]
        self.dummy_input = torch.rand(1, 3, self.configs.compile.w_input, self.configs.compile.h_input)
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

        with torch.no_grad():
            self.scripted_model.eval()
            dynamic_axes = {"input": {0: 'batch', 1: 'channels', 2: 'width', 3: 'height'},
                            "output": {0: 'batch', 1: 'channels', 2: 'width', 3: 'height'}}  # adding names for better debugging

            torch_onnx.export(self.scripted_model, self.dummy_input.to("cuda"), self.onnx_path, verbose=True,
                              input_names=input_names,
                              output_names=output_names,
                              dynamic_axes=dynamic_axes,
                              do_constant_folding=True,
                              opset_version=13,
                              training=TrainingMode.EVAL
                              )
            self.__check_onnx()
        logger.info("______________module was converted to onnx successfully___________ ")


    def __create_session(self):
        session_options = ort.SessionOptions()
        session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        provider_options = None
        providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
        self.session = ort.InferenceSession(self.onnx_path, sess_options=session_options,
                                            provider_options=provider_options, providers=providers)


Versions

PyTorch version: 1.12.1+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: version 3.24.1 Libc version: glibc-2.35

Python version: 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Nvidia driver version: 510.39.01 cuDNN version: Probably one of the following: /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.4.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.23.1 [pip3] pytorch-quantization==2.1.2 [pip3] pytorch-triton==2.0.0+0d7e753227 [pip3] torch==1.12.1+cu113 [pip3] torchaudio==0.12.1+cu113 [pip3] torchmetrics==0.9.3 [pip3] torchsummary==1.5.1 [pip3] torchvision==0.13.1+cu113 [conda] mxnet-mkl 1.6.0 pypi_0 pypi [conda] numpy 1.23.1 pypi_0 pypi [conda] pytorch-quantization 2.1.2 pypi_0 pypi [conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi [conda] torch 1.12.1+cu113 pypi_0 pypi [conda] torchaudio 0.12.1+cu113 pypi_0 pypi [conda] torchmetrics 0.9.3 pypi_0 pypi [conda] torchsummary 1.5.1 pypi_0 pypi [conda] torchvision 0.13.1+cu113 pypi_0 pypi

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel

fatemebafghi avatar Jan 08 '23 09:01 fatemebafghi

I also get the same error using pytorch 1.13 PyTorch version: 1.13.1+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.1 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: version 3.24.1 Libc version: glibc-2.35

Python version: 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Nvidia driver version: 510.39.01 cuDNN version: Probably one of the following: /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.4.1 /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.4.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] numpy==1.23.1 [pip3] pytorch-quantization==2.1.2 [pip3] pytorch-triton==2.0.0+0d7e753227 [pip3] torch==1.13.1+cu116 [pip3] torchaudio==0.13.1+cu116 [pip3] torchmetrics==0.9.3 [pip3] torchsummary==1.5.1 [pip3] torchvision==0.14.1+cu116 [conda] mxnet-mkl 1.6.0 pypi_0 pypi [conda] numpy 1.23.1 pypi_0 pypi [conda] pytorch-quantization 2.1.2 pypi_0 pypi [conda] pytorch-triton 2.0.0+0d7e753227 pypi_0 pypi [conda] torch 1.13.1+cu116 pypi_0 pypi [conda] torchaudio 0.13.1+cu116 pypi_0 pypi [conda] torchmetrics 0.9.3 pypi_0 pypi [conda] torchsummary 1.5.1 pypi_0 pypi [conda] torchvision 0.14.1+cu116 pypi_0 pypi

fatemebafghi avatar Jan 08 '23 09:01 fatemebafghi

I have fixed this error by converting all my // divisions in the code to /. but now I get the following error : onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./onnx/results/onnx_models/maxim_model.onnx failed:This is an invalid model. Type Error: Type 'tensor(float)' of input parameter (/unetencoderblock00/residualsplitheadmultiaxisgmlpLayer/Div_1_output_0) of operator (SplitToSequence) in node (/unetencoderblock00/residualsplitheadmultiaxisgmlpLayer/SplitToSequence) is invalid.

fatemebafghi avatar Jan 08 '23 09:01 fatemebafghi

I think it stills force casting to float in some points although I have used int in casting in many parts of my code, I can still see in onnx graph that it has been casted to float. It also happens in Mul nodes. where I multiply two float number and then cast them to int. it seems that this cast does not happen or then forces to float

fatemebafghi avatar Jan 08 '23 11:01 fatemebafghi

I found what the problem was. apparently, when we cast a value to int by using int(value). it turns into float in onnx graph. I ommited all my casts to int in subtraction using / and used // instead and now my problem is solved.

fatemebafghi avatar Jan 22 '23 07:01 fatemebafghi

Could you provide a full repro? the provided snippet is not enough to copy-paste-run :) In fact, there is not a model in the bug report to debug

thiagocrepaldi avatar Jan 23 '23 22:01 thiagocrepaldi

Thank you very much for your answer. I found the issue. It seems that whenever I have used int() for casting my variable to integer (for example in int(a/b)). it is casted to float in onnx graph. when I omit int and use // instead of /. my problem is solved.

fatemebafghi avatar Jan 24 '23 06:01 fatemebafghi

I found the issue. It seems that whenever I have used int() for casting my variable to integer (for example in int(a/b)). it is casted to float in onnx graph. when I omit int and use // instead of /. my problem is solved.

Great you figured out for your model. We are looking for an opportunity to fix this on pytorch side to prevent users from having to change their code, so if you could share a repro, we could try to have this fixed

thiagocrepaldi avatar Jan 24 '23 19:01 thiagocrepaldi

Closing due to lack of repro

thiagocrepaldi avatar Feb 15 '23 19:02 thiagocrepaldi