openvino icon indicating copy to clipboard operation
openvino copied to clipboard

[Bug]: Pytorch model converted to openvino returns zeros on Intel GPU

Open DavitGrigoryan132 opened this issue 1 year ago • 2 comments

OpenVINO Version

2024.0.0-14473-3238290df5e

Operating System

Other (Please specify in description)

Device used for inference

GPU

Framework

PyTorch

Model used

No response

Issue description

I created a custom model to reproduce the bug, in that model I am getting tensor of shape (1, 4096, 4096) reshaping it to the tensor of (1, 64, 64, 64, 64) tensor, running some other functionality on that tensor and again reshaping it to (1, 4096, 4096), and because of that it returns zeros as an output, but If I comment that reshaping lines everything works well

class CustomModel(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def buggy_method(self, mask):
        mask = mask.view(mask.shape[0], 64, 64, 64, 64)

        # running some calculations

        mask = mask.view(mask.shape[0], 64 * 64, 64 * 64)

        return mask

    # x has shape (1, 4096, 4096)
    def forward(self, x):
        mask = (x > 0.5).float()

        # If I comment this function everything works well
        mask = self.buggy_method(mask)

        mask = mask \
               * (x == x.max(dim=2, keepdim=True)[0]) \
               * (x == x.max(dim=1, keepdim=True)[0])

        a, _ = mask.max(dim=2)
        b = a[0] * torch.arange(0, a.shape[1])

        return a, b

This is my custom model and here a is returning valid values but b is returning all zeros on GPU but works well on CPU, but if I comment the line with self.buggy_method everything will work well

Step-by-step reproduction

This is my script for bug reproduction. Also you can find the script and input tensor in this google drive

import torch
import openvino as ov
import numpy as np


class CustomModel(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def buggy_method(self, mask):
        mask = mask.view(mask.shape[0], 64, 64, 64, 64)

        # running some calculations

        mask = mask.view(mask.shape[0], 64 * 64, 64 * 64)

        return mask

    # x has shape (1, 4096, 4096)
    def forward(self, x):
        mask = (x > 0.5).float()

        # If I comment this function everything works well
        mask = self.buggy_method(mask)

        mask = mask \
               * (x == x.max(dim=2, keepdim=True)[0]) \
               * (x == x.max(dim=1, keepdim=True)[0])

        a, _ = mask.max(dim=2)
        b = a[0] * torch.arange(0, a.shape[1])

        return a, b

if __name__ == "__main__":
    model = CustomModel()
    model = model.to("cpu")
    model = model.eval()

    dummy_input = torch.load("conf_matrix.pt")

    with torch.no_grad():
        _ = model(dummy_input)


    core = ov.Core()
    ov_model = ov.convert_model(model, example_input=dummy_input)

    compiled_model_cpu = core.compile_model(model=ov_model, device_name="CPU", config={"INFERENCE_PRECISION_HINT": ov.Type.f32})
    compiled_model_gpu = core.compile_model(model=ov_model, device_name="GPU.0", config={"INFERENCE_PRECISION_HINT": ov.Type.f32})

    openvino_dummy_input = dummy_input.numpy()

    output_cpu = compiled_model_cpu(openvino_dummy_input)
    output_gpu = compiled_model_gpu(openvino_dummy_input)

    print(output_cpu[0])
    print(output_gpu[0])
    print(np.all(output_cpu[0] - output_gpu[0] < 1e-5))

    print(output_cpu[1])
    print(output_gpu[1])
    print(output_gpu[1].max())
    print(np.all(output_cpu[1] - output_gpu[1] < 1e-5))

Here if I run this code, I am getting this output

[[1. 1. 1. ... 1. 1. 1.]] [[1. 1. 1. ... 1. 1. 1.]] True [0.000e+00 1.000e+00 2.000e+00 ... 4.093e+03 4.094e+03 4.095e+03] [0. 0. 0. ... 0. 0. 0.] 0.0 False

And we see here that the return value of a on cpu and gpu are the same, but the return value of b on gpu are all zeros. If I comment the line with self.buggy_method I will get this output

[[1. 1. 1. ... 1. 1. 1.]] [[1. 1. 1. ... 1. 1. 1.]] True [0.000e+00 1.000e+00 2.000e+00 ... 4.093e+03 4.094e+03 4.095e+03] [0.000e+00 1.000e+00 2.000e+00 ... 4.093e+03 4.094e+03 4.095e+03] 4095.0 True

And everything works well

Relevant log output

[[1. 1. 1. ... 1. 1. 1.]]
[[1. 1. 1. ... 1. 1. 1.]]
True
[0.000e+00 1.000e+00 2.000e+00 ... 4.093e+03 4.094e+03 4.095e+03]
[0. 0. 0. ... 0. 0. 0.]
0.0
False

Issue submission checklist

  • [X] I'm reporting an issue. It's not a question.
  • [X] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • [X] There is reproducer code and related data files such as images, videos, models, etc.

DavitGrigoryan132 avatar Feb 17 '24 10:02 DavitGrigoryan132

I've run the main.py script with and without Line 24 self.buggy_method and I encountered the same issue as you did.

Comment Line 24: comment line24

Uncomment Line 24: uncomment line24

We'll investigate the issue and update you as soon as possible.

Wan-Intel avatar Feb 25 '24 03:02 Wan-Intel

Ref. 140654

avitial avatar May 07 '24 19:05 avitial