nncf ov.compile_model(quantized_model, device

🐛 Describe the bug

Hi all,

My objective has been to run inference on Intel GPUs with a quantized INT8 model using nncf and an openvino model as input for quantization.

I have been trying to apply post-training-quantization to a custom vision model (pretrained vgg16 pytorch model) which I have already finetuned using "xpu" (Intel GPU Max Series). I have saved the resulting model weights from this finetuning in "pt_training_xpu_none.pt" (I cannot attach the file as it is too big).

First, I have been able to convert this model to openvino IR format in this way:

import torch
import openvino as ov

from modules import VGG16Custom

model_path = "./pt_training_xpu_none.pt"

model = VGG16Custom()
model.load_state_dict(torch.load(model_path))
model.eval()

ov_model = ov.convert_model(model)

Afterwards I wanted to quantize this ov_model using nncf.quantize() which has worked but gave me several error outputs:

Quantization code:

from modules import img_generator_pytorch
import nncf

val_dataset_path = "./dataset/val"
quantized_model_path = "./pt_inference_xpu_ov.xml"

data_loader = img_generator_pytorch(val_dataset_path, batch_size=1, shuffle=True, drop_remainder= True)
calibration_dataset = nncf.Dataset(data_loader, transform_func= lambda x: x[0]) 

quantized_model = nncf.quantize(ov_model, calibration_dataset)
print("Model type: ", type(quantized_model))

ov.save_model(quantized_model, quantized_model_path)

Ouput from quantization code:

WARNING:nncf:NNCF provides best results with torch==2.3.*, while current torch version is 2.1.0.post2+cxx11.abi. If you encounter issues, consider switching to torch==2.3.*
Statistics collection ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 300/300 • 0:00:09 • 0:00:00
Applying Fast Bias correction ━━━━━━━━━━━━━━━━━━━ 100% 16/16 • 0:00:00 • 0:00:00
error: Kernel compiled with required subgroup size 8, which is unsupported on this platform
in kernel: 'fully_connected_gpu_imad_15034862163631771970_0_0__sa'
error: backend compiler failed build.

error: Kernel compiled with required subgroup size 8, which is unsupported on this platform
in kernel: 'fully_connected_gpu_imad_17660984341792800651_0_0__sa'
error: backend compiler failed build.

Model type:  <class 'openvino.runtime.ie_api.Model'>

Finally when I get to the compilation part I load the quantized model and as it is supposed to be ov ir format I try to compile it using ov.model_compile() and device_name="GPU":

Compilation code:

core = ov.Core()
model = core.compile_model( model=quantized_model_path , device_name=device_name)

And I get the following output trace with errors:

2024-07-01 13:49:50,443 - root - INFO - Loading model from trained_models/pt_training_xpu_none.pt
2024-07-01 13:49:51,195 - root - INFO - 1730 images belonging to classes: ['Dead_Knot', 'knot_with_crack', 'resin']
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/cic/intel_sustainable_AI_phase2/modules/inference_pytorch.py", line 240, in inference_pt_nncf
    model = core.compile_model(
  File "/opt/anaconda3/envs/gpu-pt/lib/python3.9/site-packages/openvino/runtime/ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
RuntimeError: Exception from src/inference/src/cpp/core.cpp:121:
Exception from src/inference/src/dev/plugin.cpp:59:
Check 'false' failed at src/plugins/intel_gpu/src/plugin/program_builder.cpp:185:

Also I ask myself how can I specify the device where quantization is carried out, in this case Intel GPU? Is it already specified somewhere or detected automatically? I have seen some config examples but I'm not fully sure of how to define and to apply this configuration.

Environment

Python libraries versions:

nncf: 2.11.0
openvino: 2024.1.0
torch: 2.1.0.post2
torchaudio: 2.1.0.post2
torchvision: 0.16.0.post2

Device Info: x2 Intel(R) Data Center GPU Max 1100

Minimal Reproducible Example

To reproduce the issue I think you can use instead of my custom VGG16 class an instance of the VGG16 PyTorch model with the default weights (model = models.vgg16(weights="VGG16_Weights.DEFAULT")) and for the data_loader for calibration part this is my custom function:

def img_generator_pytorch(dataset_path: str, batch_size: int, shuffle: bool, drop_remainder: bool = False):

    data_transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])

    dataset = ImageFolder(dataset_path, transform=data_transform)
    data_loader = DataLoader(dataset, 
                             batch_size, 
                             shuffle=shuffle, 
                             drop_last=drop_remainder,
                             pin_memory=True)
    
    return data_loader

Are you going to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

Jul 01 '24 14:07 paguilomanas

@paguilomanas , please note that the quantization actually failed, but I think it's not due to NNCF but some issue with your code or settings. Can you please try to compile and infer the converted OV model (not quantized one) and check if it works before the quantization? Can you also specify CPU where you try to quantize the model (note that the quantization can only be performed on CPU)? cc @alexsu52

Jul 02 '24 05:07 MaximProshin

@vladimir-paramuzov do you have any ideas the reason of this error?

error: Kernel compiled with required subgroup size 8, which is unsupported on this platform
in kernel: 'fully_connected_gpu_imad_15034862163631771970_0_0__sa'
error: backend compiler failed build.

error: Kernel compiled with required subgroup size 8, which is unsupported on this platform
in kernel: 'fully_connected_gpu_imad_17660984341792800651_0_0__sa'
error: backend compiler failed build.

Could these errors be due to a compilation error of the quantized model?

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/cic/intel_sustainable_AI_phase2/modules/inference_pytorch.py", line 240, in inference_pt_nncf
    model = core.compile_model(
  File "/opt/anaconda3/envs/gpu-pt/lib/python3.9/site-packages/openvino/runtime/ie_api.py", line 543, in compile_model
    super().compile_model(model, device_name, {} if config is None else config),
RuntimeError: Exception from src/inference/src/cpp/core.cpp:121:
Exception from src/inference/src/dev/plugin.cpp:59:
Check 'false' failed at src/plugins/intel_gpu/src/plugin/program_builder.cpp:185:

Jul 02 '24 07:07 alexsu52

@paguilomanas @alexsu52 Looks like the issue was fixed by https://github.com/openvinotoolkit/openvino/pull/24691 The patch is not available in neither 2024.1 nor 2024.2 packages, so please try to install nightly build:

pip install openvino-nightly

Also, you may probably try to reshape model to static shapes, it will likely lead to onednn backend usage, so this error will not occur

Jul 02 '24 09:07 vladimir-paramuzov

Thank you both!

@MaximProshin the openvino model was working well when converted and compiled after (without quantizing). I still don't see where could I specify the device.

However what solved the issue was finally reshaping the model to a static shape before quantizing as @vladimir-paramuzov suggested. I did this by setting the input argument in the ov.convert_model(model, input=[bs, 3, IMG_SIZE, IMG_SIZE]) function. The only catch is this forces you to convert the model to a different shape every time you want to change the inference batch size.

Jul 05 '24 07:07 paguilomanas

@paguilomanas , thanks for your feedback! It means that there is some issue in GPU Plugin of OpenVINO and if you want to follow it up, feel free to open an issue in https://github.com/openvinotoolkit/openvino/issues

Jul 05 '24 08:07 MaximProshin

ov.compile_model(quantized_model, device_name="GPU") is not working

🐛 Describe the bug

Environment

Minimal Reproducible Example

Are you going to submit a PR?