pytorch_dlprim Strange error in test: Diff too big

On OpenCL CPU. After update OpenCL runtime I see another error like error in other test script:

Mean 1d
Accessing device #0:AMD EPYC 7542 32-Core Processor                 on Intel(R) CPU Runtime for OpenCL(TM) Applications
torch.Size([1, 3, 4])
torch.Size([1, 3, 4])
         y 0.000000
        x0 0.000000
Mean 2d
torch.Size([2, 1, 1])
torch.Size([2, 1, 1])
         y 0.000000
        x0 0.000000
Mean 1d squeeze
torch.Size([3, 4])
torch.Size([3, 4])
         y 0.000000
        x0 0.000000
Mean 2d squeeze
torch.Size([3])
torch.Size([3])
         y 0.000000
        x0 0.000000
Mean all squeeze
torch.Size([])
torch.Size([])
         y 0.000000
        x0 0.000000
Sum 1d
torch.Size([1, 3, 4])
torch.Size([1, 3, 4])
         y 0.000000
        x0 0.000000
Sum 2d
torch.Size([2, 1, 1])
torch.Size([2, 1, 1])
         y 0.000000
        x0 0.000000
Sum 1d squeeze
torch.Size([3, 4])
torch.Size([3, 4])
         y 0.000000
        x0 0.000000
Sum 2d squeeze
torch.Size([3])
torch.Size([3])
         y 0.000000
        x0 0.000000
LogSoftmax
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
LogSoftmax
torch.Size([4, 3, 5])
torch.Size([4, 3, 5])
         y 0.000000
        x0 0.000000
Softmax
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
NLLLoss
torch.Size([])
torch.Size([])
tensor(0.0413, grad_fn=<NllLossBackward0>)
tensor(0.0418)
         y 0.000469
        x0 0.000000
AAPool2d
torch.Size([4, 8, 1, 1])
torch.Size([4, 8, 1, 1])
         y 0.000000
        x0 0.000000
Abs
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Abs_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Hardtanh
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Hardtanh_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Sigmoid
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
Sigmoid_
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
Hardsigmoid
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Hardsigmoid_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
ReLU
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
ReLU_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
LReLu
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
LReLU_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Tanh
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Tanh_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
SiLU
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
SiLU_
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
GELU
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
GELU tanh
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
BCE Loss
torch.Size([])
torch.Size([])
        x0 0.000001
        x1 0.000000
         y 0.000000
BCE Loss no reduction
torch.Size([4, 3, 5])
torch.Size([4, 3, 5])
        x0 0.000001
         y 0.000000
        x1 0.000000
MSE Loss
torch.Size([])
torch.Size([])
         y 0.000000
        x0 0.000000
        x1 0.000000
MSE Loss no reduction
torch.Size([4, 3, 5])
torch.Size([4, 3, 5])
         y 0.000000
        x0 0.000000
        x1 0.000000
Min
Ok
Max
Ok
Dot
Ok
Clamp 1
Ok
Clamp 2
Ok
Clamp 3
Ok
Linear 2d
    p_bias 0.000000
         y 0.000000
        x0 0.000000
  p_weight 0.000000
Linear 3d
    p_bias 0.000000
         y 0.000000
        x0 0.000000
  p_weight 0.000000
Conv
Traceback (most recent call last):
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim_orig/tests/test_op.py", line 282, in <module>
    test_all(r.device)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim_orig/tests/test_op.py", line 254, in test_all
    test_fwd_bwd_op([([2,6,10,20],-1)],torch.nn.Conv2d(6,8,[3,5],stride=[1,2],padding=[1,2],dilation=1,groups=2),device)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim_orig/tests/test_op.py", line 74, in test_fwd_bwd_op
    y_cpu.backward(dy_cpu,retain_graph=True)
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: could not create a primitive descriptor iterator

Apr 03 '24 22:04 sukamenev

On AMD OpenCL (AMDAPPSDK-3.0) another error:

python tests/test_op.py --device privateuseone:2
Mean 1d
Accessing device #2:AMD EPYC 7542 32-Core Processor on AMD Accelerated Parallel Processing
torch.Size([1, 3, 4])
torch.Size([1, 3, 4])
tensor([[[-0.2863, -0.1444,  1.4827, -0.2142],
         [ 0.9526, -1.2787,  0.7404, -0.3989],
         [ 0.8163,  0.2142,  0.2852,  0.8597]]], grad_fn=<MeanBackward1>)
tensor([[[1.4019, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000]]])
         y 1.688240
        x0 0.000000
Traceback (most recent call last):
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim_orig/tests/test_op.py", line 282, in <module>
    test_all(r.device)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim_orig/tests/test_op.py", line 158, in test_all
    test_fwd_bwd([([2,3,4],-1)],lambda x:torch.mean(x,dim=0,keepdim=True),device)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim_orig/tests/test_op.py", line 153, in test_fwd_bwd
    raise Exception("Diff too big")
Exception: Diff too big

max_diff = 1.9810690879821777

Apr 03 '24 22:04 sukamenev

On AMD OpenCL (from amdgpu-pro) also error in the end of test:

Mean 1d
Accessing device #3:Fiji on AMD Accelerated Parallel Processing
torch.Size([1, 3, 4])
torch.Size([1, 3, 4])
         y 0.000000
        x0 0.000000
Mean 2d
torch.Size([2, 1, 1])
torch.Size([2, 1, 1])
        x0 0.000000
         y 0.000000
Mean 1d squeeze
torch.Size([3, 4])
torch.Size([3, 4])
         y 0.000000
        x0 0.000000
Mean 2d squeeze
torch.Size([3])
torch.Size([3])
         y 0.000000
        x0 0.000000
Mean all squeeze
torch.Size([])
torch.Size([])
         y 0.000000
        x0 0.000000
Sum 1d
torch.Size([1, 3, 4])
torch.Size([1, 3, 4])
         y 0.000000
        x0 0.000000
Sum 2d
torch.Size([2, 1, 1])
torch.Size([2, 1, 1])
         y 0.000000
        x0 0.000000
Sum 1d squeeze
torch.Size([3, 4])
torch.Size([3, 4])
         y 0.000000
        x0 0.000000
Sum 2d squeeze
torch.Size([3])
torch.Size([3])
         y 0.000000
        x0 0.000000
LogSoftmax
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
LogSoftmax
torch.Size([4, 3, 5])
torch.Size([4, 3, 5])
        x0 0.000000
         y 0.000000
Softmax
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
NLLLoss
torch.Size([])
torch.Size([])
         y 0.000000
        x0 0.000000
AAPool2d
torch.Size([4, 8, 1, 1])
torch.Size([4, 8, 1, 1])
         y 0.000000
        x0 0.000000
Abs
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Abs_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Hardtanh
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Hardtanh_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Sigmoid
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Sigmoid_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Hardsigmoid
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Hardsigmoid_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
ReLU
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
ReLU_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
LReLu
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
LReLU_
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
Tanh
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
Tanh_
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
SiLU
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
SiLU_
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
GELU
torch.Size([4, 3])
torch.Size([4, 3])
         y 0.000000
        x0 0.000000
GELU tanh
torch.Size([4, 3])
torch.Size([4, 3])
        x0 0.000000
         y 0.000000
BCE Loss
torch.Size([])
torch.Size([])
        x0 0.000058
         y 0.000000
        x1 0.000000
BCE Loss no reduction
torch.Size([4, 3, 5])
torch.Size([4, 3, 5])
        x0 0.000008
         y 0.000000
        x1 0.000000
MSE Loss
torch.Size([])
torch.Size([])
         y 0.000000
        x0 0.000000
        x1 0.000000
MSE Loss no reduction
torch.Size([4, 3, 5])
torch.Size([4, 3, 5])
         y 0.000000
        x0 0.000000
        x1 0.000000
Min
Ok
Max
Ok
Dot
Ok
Clamp 1
Ok
Clamp 2
Ok
Clamp 3
Ok
Linear 2d
  p_weight 0.000000
    p_bias 0.000000
         y 0.000000
        x0 0.000000
Linear 3d
  p_weight 0.000002
    p_bias 0.000000
         y 0.000000
        x0 0.000000
Conv
Traceback (most recent call last):
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/test_op.py", line 282, in <module>
    test_all(r.device)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/test_op.py", line 254, in test_all
    test_fwd_bwd_op([([2,6,10,20],-1)],torch.nn.Conv2d(6,8,[3,5],stride=[1,2],padding=[1,2],dilation=1,groups=2),device)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/test_op.py", line 74, in test_fwd_bwd_op
    y_cpu.backward(dy_cpu,retain_graph=True)
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: could not create a primitive descriptor iterator

Apr 04 '24 14:04 sukamenev

Sorry for late reply... For some reason missed it.

What pytorch version and what is the GPU are you using?

Aug 08 '24 03:08 artyom-beilis

I'm using PyTorch version 1.13.1 and Amd Fury

Mean 1d Accessing device #1:AMD Radeon R9 Fury Series (radeonsi, fiji, LLVM 17.0.6, DRM 3.57, 6.8.9-calculate) on rusticl ....... Sum 2d squeeze torch.Size([3]) torch.Size([3]) y 0.000000 x0 0.000000 LogSoftmax LLVM ERROR: Cannot select: 0x7feb044c5610: f32 = and 0x7feb044c54c0, Constant:i32<2147483647> 0x7feb044c54c0: f32 = bitcast 0x7feb040d5410 0x7feb040d5410: i32,ch = CopyFromReg 0x5562cb846890, Register:i32 %14 0x7feb044a2570: i32 = Register %14 0x7feb044c3120: i32 = Constant<2147483647> In function: main Аварийный останов

Aug 15 '24 18:08 sukamenev

1st CPU is not supported

on rusticl

From my experience rusticl is horrible buggy. It crashes from my on rx560. Try AMD rocm opencl driver or Mesa driver

Aug 16 '24 03:08 artyom-beilis