intel-extension-for-pytorch Invalid output and errors using model = ipex.optimize(model): split master weight unsupported, Conv BatchNorm folding failed, Linear BatchNorm folding failed

Hi, trying to run inference with a pretrained OFA (OFA-huge) model according to these instructions:

https://github.com/OFA-Sys/OFA/blob/feature/add_transformers/transformers.md

This runs fine on both CPU and CUDA but using XPU results in gibberish. I also get several warnings which go away when model = ipex.optimize(model) is commented out. With essentially the only change between CPU/CUDA and XPU being the .to('xpu') part, the model still outputs gibberish.

Warnings from model = ipex.optimize(model):

  warnings.warn(
./OFA-huge
<super: <class 'OFATokenizer'>, <OFATokenizer object>>
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/workspace/pytorch/aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:447: UserWarning: For XPU device, the split master weight is unsupported for now, so temp to disable it
  warnings.warn("For XPU device, the split master weight is unsupported for now, so temp to disable it")
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:457: UserWarning: For XPU device to save valuable device memory, temp to do optimization on inplaced model, so                     make inplace to be true
  warnings.warn(
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:464: UserWarning: For XPU, the weight prepack and sample input are disabled. The onednn layout                     is automatically chosen to use
  warnings.warn(
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:486: UserWarning: Conv BatchNorm folding failed during the optimize process.
  warnings.warn("Conv BatchNorm folding failed during the optimize process.")
/home/mediamatik/.virtualenvs/keplermatik_whisper_api/lib/python3.10/site-packages/intel_extension_for_pytorch/frontend.py:491: UserWarning: Linear BatchNorm folding failed during the optimize process.
  warnings.warn("Linear BatchNorm folding failed during the optimize process.")

[' this is the ch ch chaval all the is is the word for the band that is'] ^ gibberish output

With CPU/CUDA: [' a black and white photo of a wolf walking through the woods at night.'] ^ correct output

I'm running Ubuntu 22.04 with 1.13.10+xpu, code is below:

import warnings
from PIL import Image
from torchvision import transforms
from transformers import OFATokenizer, OFAModel
import intel_extension_for_pytorch as ipex

chkpt_dir = "./OFA-huge"
path_to_image = "image.jpg"
mean, std = [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]
resolution = 256
patch_resize_transform = transforms.Compose([
        lambda image: image.convert("RGB"),
        transforms.Resize((resolution, resolution), interpolation=Image.BICUBIC),
        transforms.ToTensor(),
        transforms.Normalize(mean=mean, std=std)
    ])


tokenizer = OFATokenizer.from_pretrained(chkpt_dir)

txt = " what does the image describe?"
inputs = tokenizer([txt], return_tensors="pt").input_ids
img = Image.open(path_to_image)
patch_img = patch_resize_transform(img).unsqueeze(0)

model = OFAModel.from_pretrained(chkpt_dir, use_cache=False)
model = model.to("xpu")
patch_img = patch_img.to("xpu")
inputs = inputs.to("xpu")
model = ipex.optimize(model)

gen = model.generate(inputs, patch_images=patch_img, num_beams=5, no_repeat_ngram_size=3)

print(tokenizer.batch_decode(gen, skip_special_tokens=True))

Image:

Thanks!

Feb 19 '23 01:02 nathanodle

Which GPU did you run on?

Feb 19 '23 20:02 jingxu10

We will look into this issue.

Feb 19 '23 21:02 jingxu10

Which GPU did you run on?

Sorry, I should have mentioned that. Arc 770, latest drivers on Ubuntu.

Thank you very much for looking into this, I really appreciate it!

Feb 20 '23 03:02 nathanodle

Is there an eta for someone to look at this? Just curious as I have a project I'm trying to validate on ARC. Thanks!

Feb 27 '23 00:02 nathanodle

We are looking into this issue, and will update later. Seems like there are some issues found.

Feb 27 '23 22:02 jingxu10

similar issue while trying to run openai-whisper on A770

     from . import load_model
+    import intel_extension_for_pytorch as ipex

     model = load_model(model_name, device=device, download_root=model_dir)
+    model.eval()
+    model = model.to('xpu')
+    ipex.optimize(model)

whisper --model tiny --language en --task transcribe --device xpu ...

results in

intel_extension_for_pytorch/frontend.py:264: UserWarning: Conv BatchNorm folding failed during the optimize process.
intel_extension_for_pytorch/frontend.py:277: UserWarning: pending the optimization for LSTM

Whipser then fails to decode the tokens.

torch                       1.10.0a0+git3d5f2d4
intel-extension-for-pytorch 1.10.200+gpu

. /opt/intel/oneapi/tbb/2021.8.0/env/vars.sh
. /opt/intel/oneapi/compiler/2022.2.0/env/vars.sh
. /opt/intel/oneapi/mkl/2022.2.0/env/vars.sh

> sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.14.7.0.30_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 9 5900X 12-Core Processor             3.0 [2022.14.7.0.30_160000]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x56a0] 3.0 [22.49.25018.23]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x56a0] 1.3 [1.3.25018]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

> uname -a
Linux 5.17.0-1020-oem #21-Ubuntu SMP PREEMPT Fri Oct 14 09:33:24 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Mar 23 '23 07:03 leuc

Update, I have also tried this with an Intel I9-11900K CPU and A770 with the same result. The first attempt was using an AMD Threadripper. The code does not work on either platform. Is there a timeline for this issue? Thanks so much!

Apr 02 '23 05:04 nathanodle

This issue will be fixed in the next release soon.

Apr 02 '23 20:04 jingxu10

Just a note, I have gotten bad results with every single model I've tried to use with XPU, it's not limited to this model. From my perspective, ARC has been unusable for almost 2 months now. I bought 6 Arc A770s for a project and this has been a waste so far.

I understand that I'm just one user and your team has their own plan. Can you give me anything to help me use these cards though? Is there a branch I can try or at least can you provide a release date so I know if I should continue trying with this hardware? Thanks very much!

Apr 16 '23 04:04 nathanodle

This incorrect output issue had been fixed in the latest code base. The next release is pending, though, you can try compile from source at this moment with https://github.com/intel/intel-extension-for-pytorch/blob/xpu-master/scripts/compile_bundle.sh. You need to use oneAPI basekit 2023.1 and with driver 602. https://dgpu-docs.intel.com/releases/stable_602_20230323.html

Apr 20 '23 07:04 jingxu10

similar issue while trying to run openai-whisper on A770

     from . import load_model
+    import intel_extension_for_pytorch as ipex

     model = load_model(model_name, device=device, download_root=model_dir)
+    model.eval()
+    model = model.to('xpu')
+    ipex.optimize(model)

whisper --model tiny --language en --task transcribe --device xpu ...

results in

intel_extension_for_pytorch/frontend.py:264: UserWarning: Conv BatchNorm folding failed during the optimize process.
intel_extension_for_pytorch/frontend.py:277: UserWarning: pending the optimization for LSTM

Whipser then fails to decode the tokens.

torch                       1.10.0a0+git3d5f2d4
intel-extension-for-pytorch 1.10.200+gpu

. /opt/intel/oneapi/tbb/2021.8.0/env/vars.sh
. /opt/intel/oneapi/compiler/2022.2.0/env/vars.sh
. /opt/intel/oneapi/mkl/2022.2.0/env/vars.sh

> sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2022.14.7.0.30_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 9 5900X 12-Core Processor             3.0 [2022.14.7.0.30_160000]
[opencl:gpu:2] Intel(R) OpenCL HD Graphics, Intel(R) Graphics [0x56a0] 3.0 [22.49.25018.23]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x56a0] 1.3 [1.3.25018]
[host:host:0] SYCL host platform, SYCL host device 1.2 [1.2]

> uname -a
Linux 5.17.0-1020-oem #21-Ubuntu SMP PREEMPT Fri Oct 14 09:33:24 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Hi, at this moment, please try compiling the latest code from source for now. Please take a reference to the comment above.

Apr 20 '23 07:04 jingxu10

compilation took hours and multiple attempts, but whisper is working with the xpu-master branch and even loads the large model into the 16GB VRAM.

$ whisper --language en --model large --device xpu some.mp3
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:484: UserWarning: Split Master Weight feature is not supported on XPU for now, disabled.
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:494: UserWarning: To reduce device memory usage on XPU, optimization are done inplace, setting the inplace argument to True.
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:500: UserWarning: Weight Prepack and Sample Input are both disabled on XPU. The Onednn Layout is automatically applied.
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:506: UserWarning: For XPU, the optimize_lstm(replace lstm with ipex_lstm) is unsupported, so disable it
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:526: UserWarning: Conv BatchNorm folding failed during the optimize process.
python3.10/site-packages/intel_extension_for_pytorch/frontend.py:531: UserWarning: Linear BatchNorm folding failed during the optimize process.

speed looks ok'ish, but given the warnings probably room for improvement.

intel_gpu_top shows 52% Render, 75% Blitter, 24% unknown.

whisper patch

diff --git a/whisper/transcribe.py b/whisper/transcribe.py
index ed6d820..0d9e3c8 100644
--- a/whisper/transcribe.py
+++ b/whisper/transcribe.py
@@ -429,8 +429,13 @@ def cli():
         torch.set_num_threads(threads)

     from . import load_model
+    import intel_extension_for_pytorch as ipex

     model = load_model(model_name, device=device, download_root=model_dir)
+    model.eval()
+    model = model.to(device)
+    if device == 'xpu':
+        ipex.optimize(model)

     writer = get_writer(output_format, output_dir)
     for audio_path in args.pop("audio"):

python modules

openai-whisper              20230314
intel-extension-for-pytorch 1.13.120+git5fdf9e6
torch                       1.13.0a0+git49444c3
torchaudio                  0.13.1+b90d798
torchvision                 0.14.1a0+5e8e2f1

> sycl-ls
[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.05.25593.18]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.25593]

apt packages

intel-i915-dkms 1.23.3.19.230122.18.5.17.0.1020+i38-1
intel-dpcpp-cpp-compiler-2023.1.0 2023.1.0-46347
intel-oneapi-mkl-2023.1.0         2023.1.0-46342
intel-oneapi-mkl-devel-2023.1.0   2023.1.0-46342
kernel 5.17.0-1020-oem

Apr 20 '23 20:04 leuc

above warnings go away when ipex.optimize(model) is omitted

found a metric to display GPU memory usage using lsgpu

normal usage

> lsgpu -p | grep ^lmem_
lmem_avail_bytes                : 16260284416
lmem_total_bytes                : 17079205888

openai whisper large mode loaded

lmem_avail_bytes                : 4605845504
lmem_total_bytes                : 17079205888

Apr 21 '23 00:04 leuc

took hours to build, so uploaded unofficial wheels of xpu-master here: https://github.com/leuc/intel-extension-for-pytorch/releases/tag/v1.13.120%2Bgit5fdf9e6

Apr 21 '23 16:04 leuc

@leuc How much RAM does your computer possess? It builds in around 20-25min on my workstation, utilizing slightly under 20GB of memory. However, when attempting building using a Github Actions I made (per Github docs, the VM has 7GB of memory) or a self-hosted runner on a laptop with 8GB of RAM, I didn't even get a build to finish.

@jingxu10 Having something akin to a nightly beta build from Intel could be really useful here.

Apr 22 '23 05:04 fredlarochelle

@fredlarochelle it wasn't a resource issue, but the script doesn't build well without conda. I may work on a PR for better portability, with aim for CI/CD and containers.

Apr 22 '23 07:04 leuc

@leuc Yeah, I know about conda + the GCC 11 requirement, however I had no luck with GCC 11, not consistent at all, got it working way better with GCC 9. We should probably have a look into the compiler flags used too.

Apr 22 '23 13:04 fredlarochelle

what are error messages? I would recommend to do the compilation in a docker container.

Apr 22 '23 20:04 jingxu10

what are error messages? I would recommend to do the compilation in a docker container.

addressed some build issues with PR https://github.com/intel/intel-extension-for-pytorch/pull/334

Apr 23 '23 00:04 leuc

I'm using a tiny test network that is just one linear layer. Using the updated build I still get:

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: warn(f"Failed to load image Python extension: {e}") /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:484: UserWarning: Split Master Weight feature is not supported on XPU for now, disabled. warnings.warn("Split Master Weight feature is not supported on XPU for now, disabled.") /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:500: UserWarning: Weight Prepack and Sample Input are both disabled on XPU. The Onednn Layout is automatically applied. warnings.warn( /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:506: UserWarning: For XPU, the optimize_lstm(replace lstm with ipex_lstm) is unsupported, so disable it

I don't know how this is possible because there's no LSTM at all!


import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
import math
import os
import glob
import random
import librosa
import soundfile as sf
import numpy as np


import intel_extension_for_pytorch as ipex

default_device = torch.device("xpu")



class DummyLayer(nn.Module):
    def __init__(self):
        super(DummyLayer, self).__init__()
        self.layer = nn.Linear(1, 1)
    
    def forward(self, src):
        src = src.unsqueeze(-1)
        src = self.layer(src)
        src = src.squeeze(-1)
        return src


model = DummyLayer()
model.to(default_device)
criterion = nn.MSELoss()
lr_factor = 0.1
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16, inplace=True)


target_sample_rate=8000
def load_file(path):
    data, sample_rate = librosa.load(path, sr=target_sample_rate)
    data = torch.from_numpy(data)
    data = data.unsqueeze(0)
    data = torch.mean(data.to(default_device), dim=0).unsqueeze(0)

    return data

train = load_file("testrecording_8k.wav")
target = load_file("testrecording_target_8k.wav")

# Training loop
num_epochs = 150000
for epoch in range(num_epochs):

    print("running")

    # batch = batch.to(memory_format=torch.channels_last)
    # target = target.to(memory_format=torch.channels_last)
    train = train.bfloat16()
    target = target.bfloat16()
    
    optimizer.zero_grad()
    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
        output = model(train)
        
    loss = criterion(output, target)
    
    print(f'Epoch: {epoch+1}/{num_epochs}, Step: {epoch+1}, Loss: {loss.item()}')

    print("output", output.cpu())
    print("target", target.cpu())
    loss.backward()
    optimizer.step()

    print(f'Epoch: {epoch+1}/{num_epochs}, Step: {epoch+1}, Loss: {loss.item()}')

    # every few steps save the output
    if (epoch+1) % 50 == 0:
        # Save the output to file
        output = torch.flatten(output, start_dim=0)
        
        print(output.size())
        sf.write("samples2/testrecording_8k_progress2_" + str(epoch) + ".wav", output.float().cpu().detach().numpy(), target_sample_rate)

Apr 24 '23 17:04 ghost

@zejun-chen Is this a known issue we already fixed?

May 10 '23 15:05 gujinghui

I'm using a tiny test network that is just one linear layer. Using the updated build I still get:

/usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: warn(f"Failed to load image Python extension: {e}") /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:484: UserWarning: Split Master Weight feature is not supported on XPU for now, disabled. warnings.warn("Split Master Weight feature is not supported on XPU for now, disabled.") /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:500: UserWarning: Weight Prepack and Sample Input are both disabled on XPU. The Onednn Layout is automatically applied. warnings.warn( /usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/frontend.py:506: UserWarning: For XPU, the optimize_lstm(replace lstm with ipex_lstm) is unsupported, so disable it

I don't know how this is possible because there's no LSTM at all!
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
import math
import os
import glob
import random
import librosa
import soundfile as sf
import numpy as np


import intel_extension_for_pytorch as ipex

default_device = torch.device("xpu")



class DummyLayer(nn.Module):
    def __init__(self):
        super(DummyLayer, self).__init__()
        self.layer = nn.Linear(1, 1)
    
    def forward(self, src):
        src = src.unsqueeze(-1)
        src = self.layer(src)
        src = src.squeeze(-1)
        return src


model = DummyLayer()
model.to(default_device)
criterion = nn.MSELoss()
lr_factor = 0.1
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16, inplace=True)


target_sample_rate=8000
def load_file(path):
    data, sample_rate = librosa.load(path, sr=target_sample_rate)
    data = torch.from_numpy(data)
    data = data.unsqueeze(0)
    data = torch.mean(data.to(default_device), dim=0).unsqueeze(0)

    return data

train = load_file("testrecording_8k.wav")
target = load_file("testrecording_target_8k.wav")

# Training loop
num_epochs = 150000
for epoch in range(num_epochs):

    print("running")

    # batch = batch.to(memory_format=torch.channels_last)
    # target = target.to(memory_format=torch.channels_last)
    train = train.bfloat16()
    target = target.bfloat16()
    
    optimizer.zero_grad()
    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
        output = model(train)
        
    loss = criterion(output, target)
    
    print(f'Epoch: {epoch+1}/{num_epochs}, Step: {epoch+1}, Loss: {loss.item()}')

    print("output", output.cpu())
    print("target", target.cpu())
    loss.backward()
    optimizer.step()

    print(f'Epoch: {epoch+1}/{num_epochs}, Step: {epoch+1}, Loss: {loss.item()}')

    # every few steps save the output
    if (epoch+1) % 50 == 0:
        # Save the output to file
        output = torch.flatten(output, start_dim=0)
        
        print(output.size())
        sf.write("samples2/testrecording_8k_progress2_" + str(epoch) + ".wav", output.float().cpu().detach().numpy(), target_sample_rate)

Hi, @turbobuilt Thank you for using IPEX. The warning message is thrown by model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16, inplace=True). This interface contains most of the IPEX optimization for model. It has a argument named level, which is default O1. For O1, most optimization will be enabled even if the model has no such layers. For XPU, some optimizations are disabled(For CPU, they are enabled), for example, split master weight(we will support it soon), weight prepack and optimize lstm, thus there are some warning messages because these optimizations are disabled for XPU.

@gujinghui This is caused by our warning messages from ipex.optimize.

May 11 '23 01:05 zejun-chen

Is this included in latest drivers, or still need to compile? I just ordered two a770 to fine tune and run Whisper and some other models

Jan 03 '25 21:01 gurjaapsingh

I have found an interesting feature. I use IPEX for Llama cpu finetune and i get worse performance using 'train' optimization. In my case better to do so:

model = get_peft_model(model, peft_config)
model.eval()
model = ipex.optimize(model)
model.train()

instead of:

model = get_peft_model(model, peft_config)
model, optimizer = ipex.optimize(model, optimizer=Lion(...))

Or maybe the cause of that is an unsupported optimizer?

P.S. Also i am struggling with long-context forward passes due to >4gb allocations. I already peeked into other issues and didn`t find proper fixes there. (ARC A770 16gb)

Jan 07 '25 05:01 paNikitin

intel-extension-for-pytorch intel-extension-for-pytorch copied to clipboard

Invalid output and errors using model = ipex.optimize(model): split master weight unsupported, Conv BatchNorm folding failed, Linear BatchNorm folding failed

intel-extension-for-pytorch
intel-extension-for-pytorch copied to clipboard