intel-extension-for-pytorch icon indicating copy to clipboard operation
intel-extension-for-pytorch copied to clipboard

FP64 Emulation Support is Broken; Cannot Run Own Scripts

Open tedliosu opened this issue 3 years ago • 11 comments

Original script (bench_mixed-precision_vs_single-precision_pytorch_ipex.py) I am running using IPEX:

import torch
import intel_extension_for_pytorch as ipex
import matplotlib.pyplot as plt
import time
import sys
torch.set_default_tensor_type(torch.FloatTensor)

# Comment out the following two lines if running this script in a ROCm environment
# import torch.backends.cudnn as cudnn
# cudnn.benchmark = True

def grid(width, height):
  hrange = torch.arange(width).unsqueeze(0).repeat([height, 1]).div(width)
  vrange = torch.arange(height).unsqueeze(1).repeat([1, width]).div(height)
  output = torch.stack([hrange, vrange], 0).float()
  return output


def checker(width, height, freq):
  hrange = torch.arange(width).reshape([1, width]).mul(freq / width / 2.0).fmod(1.0).gt(0.5)
  vrange = torch.arange(height).reshape([height, 1]).mul(freq / height / 2.0).fmod(1.0).gt(0.5)
  output = hrange.logical_xor(vrange).float()
  return output

if len(sys.argv) > 1 and sys.argv[1] != "bench_mixed_precision":
    print("\nUsage:", sys.argv[0], "[bench_mixed_precision]\n")
    quit()

# Note the inputs are grid coordinates and the target is a checkerboard
inputs = grid(384, 384).unsqueeze(0).to("xpu")
targets = checker(384, 384, 8).unsqueeze(0).unsqueeze(1).to("xpu")

class Net(torch.jit.ScriptModule):
  def __init__(self):
    super().__init__()
    self.net = torch.nn.Sequential(
      torch.nn.Conv2d(2, 256, 1),
      torch.nn.BatchNorm2d(256),
      torch.nn.ReLU(),
      torch.nn.Conv2d(256, 256, 1),
      torch.nn.BatchNorm2d(256),
      torch.nn.ReLU(),
      torch.nn.Conv2d(256, 256, 1),
      torch.nn.BatchNorm2d(256),
      torch.nn.ReLU(),
      torch.nn.Conv2d(256, 1, 1))

  @torch.jit.script_method
  def forward(self, x):
    return self.net(x)

net = Net().to("xpu")
loss_fn = torch.nn.MSELoss().to("xpu")
opt = torch.optim.Adam(net.parameters(), 0.001)
net, opt = ipex.optimize(net, optimizer=opt, dtype=torch.float32)

print("Starting training loop, please be patient...")

start_time = time.time()

# for i in range(400):
for i in range(3):
  opt.zero_grad()
  if len(sys.argv) > 1 and sys.argv[1] == "bench_mixed_precision":
      with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16):
        outputs = net(inputs)
        loss = loss_fn(outputs, targets)
  else:
      outputs = net(inputs)
      loss = loss_fn(outputs, targets)
  loss.backward()
  opt.step()
#  if (i + 1) % 50 == 0:
  if (i + 1) % 1 == 0:
      print(loss)
      print("Completed iteration %d/%d" % (i + 1, 400))

torch.xpu.synchronize()
print(f"Training completed in {time.time() - start_time} seconds :)")
print(loss)

Running the script without doing export OverrideDefaultFP64Settings=1 && export IGC_EnableDPEmulation=1 yields the following output:

root@d8b5bd7be0b9:/workspace# python3 bench_mixed-precision_vs_single-precision_pytorch_ipex.py | tee ipex_bench_output1.txt
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale1IffEEvPT_PKT0_mdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
...
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPUL24launch_vectorized_kernelINS0_13BUnaryFunctorIddbZZZNS0_4impl15gt_kernel_dpcppERNS_14TensorIteratorEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEN3xpu5dpcpp5ArrayIPcLi2EEE23TrivialOffsetCalculatorILi1EjEEEvlRKT_T0_T1_iENKUlRN2cl4sycl7handlerEE0_clESP_EUlNSN_7nd_itemILi1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPUL24launch_vectorized_kernelINS0_13BUnaryFunctorIddbZZZNS0_4impl15gt_kernel_dpcppERNS_14TensorIteratorEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEN3xpu5dpcpp5ArrayIPcLi2EEE23TrivialOffsetCalculatorILi1EjEEEvlRKT_T0_T1_iENKUlRN2cl4sycl7handlerEE0_clESP_EUlNSN_7nd_itemILi1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPUL24launch_vectorized_kernelINS0_13BUnaryFunctorIddbZZZNS0_4impl15gt_kernel_dpcppERNS_14TensorIteratorEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEN3xpu5dpcpp5ArrayIPcLi2EEE23TrivialOffsetCalculatorILi1EjEEEvlRKT_T0_T1_iENKUlRN2cl4sycl7handlerEE0_clESP_EUlNSN_7nd_itemILi1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPUL24launch_vectorized_kernelINS0_13BUnaryFunctorIddbZZZNS0_4impl15gt_kernel_dpcppERNS_14TensorIteratorEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEN3xpu5dpcpp5ArrayIPcLi2EEE23TrivialOffsetCalculatorILi1EjEEEvlRKT_T0_T1_iENKUlRN2cl4sycl7handlerEE0_clESP_EUlNSN_7nd_itemILi1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPUL24launch_vectorized_kernelINS0_13BUnaryFunctorIddbZZZNS0_4impl15gt_kernel_dpcppERNS_14TensorIteratorEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEN3xpu5dpcpp5ArrayIPcLi2EEE23TrivialOffsetCalculatorILi1EjEEEvlRKT_T0_T1_iENKUlRN2cl4sycl7handlerEE0_clESP_EUlNSN_7nd_itemILi1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPUL24launch_vectorized_kernelINS0_13BUnaryFunctorIddbZZZNS0_4impl15gt_kernel_dpcppERNS_14TensorIteratorEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEN3xpu5dpcpp5ArrayIPcLi2EEE23TrivialOffsetCalculatorILi1EjEEEvlRKT_T0_T1_iENKUlRN2cl4sycl7handlerEE0_clESP_EUlNSN_7nd_itemILi1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.
[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPUL24launch_vectorized_kernelINS0_13BUnaryFunctorIddbZZZNS0_4impl15gt_kernel_dpcppERNS_14TensorIteratorEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEN3xpu5dpcpp5ArrayIPcLi2EEE23TrivialOffsetCalculatorILi1EjEEEvlRKT_T0_T1_iENKUlRN2cl4sycl7handlerEE0_clESP_EUlNSN_7nd_itemILi1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.

And if I enable FP64 emulation by doing export OverrideDefaultFP64Settings=1 && export IGC_EnableDPEmulation=1 I get the following output instead:

root@d8b5bd7be0b9:/workspace# python3 bench_mixed-precision_vs_single-precision_pytorch_ipex.py 2>&1 | tee ipex_bench_output2.txt
Starting training loop, please be patient...
Traceback (most recent call last):
  File "/workspace/bench_mixed-precision_vs_single-precision_pytorch_ipex.py", line 75, in <module>
    print(loss)
  File "/opt/intel/oneapi/intelpython/latest/lib/python3.9/site-packages/torch/_tensor.py", line 249, in __repr__
    return torch._tensor_str._str(self)
  File "/opt/intel/oneapi/intelpython/latest/lib/python3.9/site-packages/torch/_tensor_str.py", line 415, in _str
    return _str_intern(self)
  File "/opt/intel/oneapi/intelpython/latest/lib/python3.9/site-packages/torch/_tensor_str.py", line 390, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/opt/intel/oneapi/intelpython/latest/lib/python3.9/site-packages/torch/_tensor_str.py", line 251, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/opt/intel/oneapi/intelpython/latest/lib/python3.9/site-packages/torch/_tensor_str.py", line 102, in __init__
    if value != torch.ceil(value):
RuntimeError: Native API failed. Native API returns: -1 (CL_DEVICE_NOT_FOUND) -1 (CL_DEVICE_NOT_FOUND)

As you can see because the script I am running (and probably may more scripts I may run in the future for that matter) requires FP64 instructions to be supported, but because the support for such instructions is broken right now as shown above, I absolutely cannot run the workloads I'd like to run using the IPEX binary wheel that you guys have distributed (and the same issue occurs when I try building my own wheels from source). Could you please fix this so that I can properly take advantage of FP64 emulation in IPEX, or at least give me advice on how I can modify my script so that I can run it without needed to resort to enabling FP64 emulation?

I'll be more than happy to provide additional info about my system if needed to help solve this issue. :smile:

tedliosu avatar Nov 11 '22 08:11 tedliosu

Hi, we don't suggest to use export OverrideDefaultFP64Settings=1 and export IGC_EnableDPEmulation=1. That error message is expected because double is not supported on ATS-M. double dtype will not be supported from software level as well. Would you mind changing your code to run with fp32?

jingxu10 avatar Nov 11 '22 09:11 jingxu10

Hi, we don't suggest to use export OverrideDefaultFP64Settings=1 and export IGC_EnableDPEmulation=1. That error message is expected because double is not supported on ATS-M. double dtype will not be supported from software level as well. Would you mind changing your code to run with fp32?

@jingxu10 Sorry I forgot to clarify that I am actually running my code on i5 11400H Tiger Lake integrated graphics, which like ATS-M does not support FP64 instructions. But how would I go about changing my code to run with FP32? I don't see any apparent FP64 instructions anywhere in the code that I have posted. :confused:

tedliosu avatar Nov 11 '22 12:11 tedliosu

Hi, we don't suggest to use export OverrideDefaultFP64Settings=1 and export IGC_EnableDPEmulation=1. That error message is expected because double is not supported on ATS-M. double dtype will not be supported from software level as well. Would you mind changing your code to run with fp32?

@jingxu10 UPDATE - unfortunately I was NOT able to identify which instructions in my python script relies on needing FP64 emulation, BUT I did figure out a way to workaround the error that pops up whenever I do enable FP64 emulation (the workaround is marked by ########## BEGIN WORKAROUND ########## and ########## END WORKAROUND ########## within the copy of the script as copy-pasted from my editor below):

import torch
import intel_extension_for_pytorch as ipex
# Uncomment following line if NOT running in a container like podman or docker
# import matplotlib.pyplot as plt
import time
import sys
torch.set_default_tensor_type(torch.FloatTensor)

# Comment out the following two lines if running this script in a ROCm environment
# import torch.backends.cudnn as cudnn
# cudnn.benchmark = True

def grid(width, height):
  hrange = torch.arange(width).unsqueeze(0).repeat([height, 1]).div(width)
  vrange = torch.arange(height).unsqueeze(1).repeat([1, width]).div(height)
  output = torch.stack([hrange, vrange], 0).float()
  return output


def checker(width, height, freq):
  hrange = torch.arange(width).reshape([1, width]).mul(freq / width / 2.0).fmod(1.0).gt(0.5)
  vrange = torch.arange(height).reshape([height, 1]).mul(freq / height / 2.0).fmod(1.0).gt(0.5)
  output = hrange.logical_xor(vrange).float()
  return output

if len(sys.argv) > 1 and sys.argv[1] != "bench_mixed_precision":
    print("\nUsage:", sys.argv[0], "[bench_mixed_precision]\n")
    quit()

# Note the inputs are grid coordinates and the target is a checkerboard
inputs = grid(384, 384).unsqueeze(0).to("xpu")
targets = checker(384, 384, 8).unsqueeze(0).unsqueeze(1).to("xpu")

class Net(torch.jit.ScriptModule):
  def __init__(self):
    super().__init__()
    self.net = torch.nn.Sequential(
      torch.nn.Conv2d(2, 256, 1),
      torch.nn.BatchNorm2d(256),
      torch.nn.ReLU(),
      torch.nn.Conv2d(256, 256, 1),
      torch.nn.BatchNorm2d(256),
      torch.nn.ReLU(),
      torch.nn.Conv2d(256, 256, 1),
      torch.nn.BatchNorm2d(256),
      torch.nn.ReLU(),
      torch.nn.Conv2d(256, 1, 1))

  @torch.jit.script_method
  def forward(self, x):
    return self.net(x)

net = Net()
net.train()
loss_fn = torch.nn.MSELoss().to("xpu")
opt = torch.optim.Adam(net.parameters(), 0.001)
net = net.to("xpu")
net, opt = ipex.optimize(net, optimizer=opt, dtype=torch.float32)

print("Starting training loop, please be patient...")

start_time = time.time()

# for i in range(400):
for i in range(3):
  opt.zero_grad()
  if len(sys.argv) > 1 and sys.argv[1] == "bench_mixed_precision":
      with torch.xpu.amp.autocast(enabled=True, dtype=torch.float16):
        outputs = net(inputs)
        loss = loss_fn(outputs, targets)
  else:
      outputs = net(inputs)
      loss = loss_fn(outputs, targets)
  loss.backward()
  opt.step()
#  if (i + 1) % 50 == 0:
  if (i + 1) % 1 == 0:
########## BEGIN WORKAROUND ##########
      temp_tensor_vals_list = []
      temp_tensor_vals_list.append(loss.tolist())
      print('tensor(' + ','.join(['{:.4e}'.format(var) for var in temp_tensor_vals_list]) +
            ', device=\'' + str(loss.device) + '\', grad_fn=' +
            str(loss.grad_fn).split(" ")[0] + '>)')
########## END WORKAROUND ##########
      print("Completed iteration %d/%d" % (i + 1, 400))

torch.xpu.synchronize()
print(f"Training completed in {time.time() - start_time} seconds :)")
########## BEGIN WORKAROUND ##########
temp_tensor_vals_list = []
temp_tensor_vals_list.append(loss.tolist())
print('tensor(' + ','.join(['{:.4e}'.format(var) for var in temp_tensor_vals_list]) +
      ', device=\'' + str(loss.device) + '\', grad_fn=' +
      str(loss.grad_fn).split(" ")[0] + '>)')
########## END WORKAROUND ##########
# print(loss)

Essentially the workaround basically prints out the same info as if I'd simply written print(loss) EXCEPT that my workaround works in cases where FP64 emulation is enabled. :sunglasses: And here's some sample output of the above script with the workaround from my terminal running a docker instance of IPEX for your reference (and for others who may stumble upon this same issue that I have been having):

root@7850376c0f2e:/workspace# export OverrideDefaultFP64Settings=1 && export IGC_EnableDPEmulation=1
root@7850376c0f2e:/workspace# python3 bench_mixed-precision_vs_single-precision_pytorch_ipex.py
Starting training loop, please be patient...
tensor(9.5251e-01, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 1/400
tensor(1.2738e+00, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 2/400
tensor(5.2136e-01, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 3/400
Training completed in 528.7019538879395 seconds :)
tensor(5.2136e-01, device='xpu:0', grad_fn=<MseLossBackward0>)
root@7850376c0f2e:/workspace# python3 bench_mixed-precision_vs_single-precision_pytorch_ipex.py
Starting training loop, please be patient...
tensor(7.1851e-01, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 1/400
tensor(1.2455e+00, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 2/400
tensor(4.5935e-01, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 3/400
Training completed in 552.707841873169 seconds :)
tensor(4.5935e-01, device='xpu:0', grad_fn=<MseLossBackward0>)
root@7850376c0f2e:/workspace#

Now for the love that is all good about tech, could you PLEASE fix IPEX (or any underlying software that IPEX is running on top of) that is causing this pesky CL_DEVICE_NOT_FOUND error when FP64 emulation is enabled for my script so that people like me do NOT have to resort to such jank workarounds in the future???

tedliosu avatar Nov 13 '22 10:11 tedliosu

@tedliosu

Thanks for the post.

As @jingxu10 mentioned, the FP64 is not supported by this kind of HW. Therefore, with current solution, we do not support any ops who have FP64 instructions in kernel body.

To enable FP64 emulation on this HW, you have to set below values, export OverrideDefaultFP64Settings=1 and export IGC_EnableDPEmulation=1

and also remove build options in this line. https://github.com/intel/intel-extension-for-pytorch/blob/9b770aaab9bbf2fe6e21f129387c83e425784111/cmake/DPCPP.cmake#L122

gujinghui avatar Nov 17 '22 07:11 gujinghui

@tedliosu

Thanks for the post.

As @jingxu10 mentioned, the FP64 is not supported by this kind of HW. Therefore, with current solution, we do not support any ops who have FP64 instructions in kernel body.

To enable FP64 emulation on this HW, you have to set below values, export OverrideDefaultFP64Settings=1 and export IGC_EnableDPEmulation=1

and also remove build options in this line.

https://github.com/intel/intel-extension-for-pytorch/blob/9b770aaab9bbf2fe6e21f129387c83e425784111/cmake/DPCPP.cmake#L122

Hello @gujinghui,

YES, I've already done that for my own builds, and YET I still ran into the CL_DEVICE_NOT_FOUND errors as mentioned in my original comment. Could you PLEASE fix your code so that I don't run into the CL_DEVICE_NOT_FOUND errors anymore in the future? Are there any questions that you may have about my request?

Sincerely, Ted

tedliosu avatar Nov 17 '22 08:11 tedliosu

Hello @tedliosu,

You're trying your scripts on integrated GPU?

With this release, the integrated GPU is not supported. As the doc mentioned, we verified on Intel® Data Center GPU Flex Series 170 card. https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html

Sorry for the inconvenience.

gujinghui avatar Nov 17 '22 12:11 gujinghui

Hello @tedliosu,

You're trying your scripts on integrated GPU?

With this release, the integrated GPU is not supported. As the doc mentioned, we verified on Intel® Data Center GPU Flex Series 170 card. https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/installation.html

Sorry for the inconvenience.

@gujinghui what about Arc Alchemist discrete graphics? Is that at least "unofficially" supported? I don't have the money to pay for nor the server setup to accommodate a Flex series 170. 😔

tedliosu avatar Nov 17 '22 14:11 tedliosu

@tedliosu This is our first release mainly for data center GPU. We are taking client dGPU into account. Please give us more time.

Although Arc Alchemist dGPU has similar arch, XPU should work as well. But without full verification, I cannot make you take risk to waste your money. :(

gujinghui avatar Nov 17 '22 15:11 gujinghui

I also get these messages printed out despite having absolutely no FP64 code. The message does not appear if you disable profiling.

Reproducer:

import torch
import torch.nn as nn
import intel_extension_for_pytorch as ipex

if __name__ == '__main__':
    x = nn.Sequential(nn.Conv2d(3, 64, (3, 3)),
                      nn.BatchNorm2d(64)) 
    inp = torch.randn(1,3,64,64)
    traced = torch.jit.trace(x, inp)
    traced.to('xpu')
    inp = inp.to('xpu')
    # Run twice
    out = traced(inp)
    out = traced(inp)

Output (100 or so of these):

[CRITICAL ERROR] Kernel 'ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale2IffEEvPT_PKT0_mfdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.

xsacha avatar Nov 23 '22 06:11 xsacha

Looks like I'll have to find some time and test this: https://github.com/intel/intel-extension-for-pytorch/issues/268#issuecomment-1422230392

Will close this issue if this turns out to be fixed; just that I'll have to compile IPEX from scratch to test it on my Intel iGPU because I don't have the budget for an Arc card right now and I'm also getting really busy with school, so please forgive me if it is a while before I report back. :pray:

tedliosu avatar Sep 16 '23 04:09 tedliosu

So today I just got around to attempting to reproduce the original issue that I had encountered with the original script that I had posted in this issue, and thankfully I think it's safe to say that the issue has already been fixed with IPEX v2.0.110+xpu, and I didn't even have to do the workaround as mentioned here! :smile:

Example output as of today (10/21/23) from when I ran the original script that prompted this issue:

(ipex_env) minbuntu@minbuntu:~/intel_pytorch_workspace/final_bench_scripts$ python3 bench_mixed-precision_vs_single-precision_pytorch_ipex_github-issue.py
No CUDA runtime is found, using CUDA_HOME='/usr'
Starting training loop, please be patient...
tensor(0.5982, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 1/400
tensor(0.9649, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 2/400
tensor(0.5115, device='xpu:0', grad_fn=<MseLossBackward0>)
Completed iteration 3/400
Training completed in 11.21572470664978 seconds :)
tensor(0.5115, device='xpu:0', grad_fn=<MseLossBackward0>)

List of pip packages I installed in order to attempt to reproduce this issue:

(ipex_env) minbuntu@minbuntu:~/intel_pytorch_workspace/final_bench_scripts$ pip list
Package                     Version
--------------------------- ------------------
astunparse                  1.6.3
attrs                       23.1.0
certifi                     2023.7.22
cffi                        1.16.0
charset-normalizer          3.3.0
cmake                       3.27.7
contourpy                   1.1.1
cycler                      0.12.1
dataclasses                 0.6
exceptiongroup              1.1.3
expecttest                  0.1.6
filelock                    3.12.4
fonttools                   4.43.1
future                      0.18.3
hypothesis                  6.88.1
idna                        3.4
iniconfig                   2.0.0
intel-extension-for-pytorch 2.0.110+git509a378
intel-openmp                2023.2.0
Jinja2                      3.1.2
kaldi-io                    0.9.8
kiwisolver                  1.4.5
MarkupSafe                  2.1.3
matplotlib                  3.8.0
mpmath                      1.3.0
networkx                    3.2
ninja                       1.11.1.1
numpy                       1.26.1
packaging                   23.2
Pillow                      10.1.0
pip                         23.3
pip-autoremove              0.10.0
pip-review                  1.3.0
pluggy                      1.3.0
psutil                      5.9.6
pycparser                   2.21
pyparsing                   3.1.1
pytest                      7.4.2
python-dateutil             2.8.2
PyYAML                      6.0.1
requests                    2.31.0
scipy                       1.11.3
setuptools                  68.2.2
six                         1.16.0
sortedcontainers            2.4.0
soundfile                   0.12.1
sympy                       1.12
tbb                         2021.10.0
tomli                       2.0.1
torch                       2.0.1a0+gite9ebda2
torchaudio                  2.0.2+31de77d
torchvision                 0.15.2a0+fa99a53
types-dataclasses           0.6.6
typing_extensions           4.8.0
urllib3                     2.0.7
wheel                       0.41.2

Since my device is a Tiger Lake iGPU, which isn't officially supported by the binaries directly provided by Intel, I had to build everything from scratch. Also, instead of running compile_bundle.sh and letting the entire build process be done automatically, I installed clang-13 and libclang-cpp13-dev from Ubuntu repositories and ran each step in the aforementioned compile_bundle.sh manually, taking care to skip the steps that involved building LLVM-13 from scratch, setting USE_LLVM to /usr/lib/llvm-13, LLVM_DIR to $USE_LLVM/lib/cmake/llvm, and USE_AOT_DEVLIST to tgllp,xe. Another thing I did differently from the compile_bundle.sh script was to just run source /opt/intel/oneapi/setvars.sh directly instead of doing whatever steps were originally done around these lines. Finally the only major issue that I ended up running into was in regards to torchaudio, which was fixed by a workaround as detailed here.

I'd advise that anyone who needs to compile everything from scratch like I do to create a virtualenv to build and install stuff with as to not mix system pip packages up with the IPEX and related pip packages.

HOWEVER, attempting to run @xsacha's reproducer as detailed here now gives me this error:

(ipex_env) minbuntu@minbuntu:~/intel_pytorch_workspace$ python3 xsacha_error_reproducer.py
No CUDA runtime is found, using CUDA_HOME='/usr'
Traceback (most recent call last):
  File "/home/minbuntu/intel_pytorch_workspace/xsacha_error_reproducer.py", line 13, in <module>
    out = traced(inp)
  File "/home/minbuntu/ipex_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
NotImplementedError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: Could not run 'ipex_prepack::convolution_prepack' with arguments from the 'XPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'ipex_prepack::convolution_prepack' is only available for these backends: [CPU, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradMPS, AutogradXPU, AutogradHPU, AutogradLazy, AutogradMeta, Tracer, AutocastCPU, AutocastXPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

CPU: registered at /home/minbuntu/Documents/all_git/intel-extension-for-pytorch/csrc/cpu/jit/cpu/kernels/RegisterOpContextClass.cpp:133 [kernel]
BackendSelect: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:144 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/functorch/DynamicLayer.cpp:491 [backend fallback]
Functionalize: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/FunctionalizeFallbackKernel.cpp:280 [backend fallback]
Named: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/native/NegateFallback.cpp:19 [backend fallback]
ZeroTensor: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:63 [backend fallback]
AutogradOther: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:30 [backend fallback]
AutogradCPU: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:34 [backend fallback]
AutogradCUDA: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:42 [backend fallback]
AutogradXLA: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:46 [backend fallback]
AutogradMPS: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:54 [backend fallback]
AutogradXPU: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:38 [backend fallback]
AutogradHPU: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:67 [backend fallback]
AutogradLazy: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:50 [backend fallback]
AutogradMeta: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/VariableFallbackKernel.cpp:58 [backend fallback]
Tracer: registered at /home/minbuntu/Documents/all_git/pytorch/torch/csrc/autograd/TraceTypeManual.cpp:294 [backend fallback]
AutocastCPU: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/autocast_mode.cpp:487 [backend fallback]
AutocastXPU: fallthrough registered at /home/minbuntu/Documents/all_git/intel-extension-for-pytorch/csrc/gpu/aten/amp/autocast_mode.cpp:233 [backend fallback]
AutocastCUDA: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/autocast_mode.cpp:354 [backend fallback]
FuncTorchBatched: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/functorch/LegacyBatchingRegistrations.cpp:815 [backend fallback]
FuncTorchVmapMode: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/functorch/VmapModeRegistrations.cpp:28 [backend fallback]
Batched: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/LegacyBatchingRegistrations.cpp:1073 [backend fallback]
VmapMode: fallthrough registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]
FuncTorchGradWrapper: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/functorch/TensorWrapper.cpp:210 [backend fallback]
PythonTLSSnapshot: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:152 [backend fallback]
FuncTorchDynamicLayerFrontMode: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/functorch/DynamicLayer.cpp:487 [backend fallback]
PythonDispatcher: registered at /home/minbuntu/Documents/all_git/pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:148 [backend fallback]


(ipex_env) minbuntu@minbuntu:~/intel_pytorch_workspace$

While that error doesn't appear to be related to the original error(s) that prompted the creation of this issue, I'll leave this issue open just in case if it somehow is. Otherwise, feel free to close this issue and mark as resolved :slightly_smiling_face:.

tedliosu avatar Oct 21 '23 21:10 tedliosu