pytorch_dlprim Error in python tests/validate_network.py --device privateuseone:1

Tested on your original code

Testing  resnet18
Accessing device #1:AMD Radeon R9 Fury Series (radeonsi, fiji, LLVM 17.0.6, DRM 3.54, 6.6.12-calculate) on rusticl
LLVM ERROR: Cannot select: 0x7f3c70430b30: f32 = and 0x7f3c70424cc0, Constant:i32<2147483647>
  0x7f3c70424cc0: f32 = bitcast 0x7f3c7042ae70
    0x7f3c7042ae70: i32 = llvm.amdgcn.wwm TargetConstant:i64<2662>, 0x7f3c70424b00
      0x7f3c70430970: i64 = TargetConstant<2662>
      0x7f3c70424b00: i32 = llvm.amdgcn.readlane TargetConstant:i64<2528>, 0x7f3c7042bc00, Constant:i32<63>
        0x7f3c704254a0: i64 = TargetConstant<2528>
        0x7f3c7042bc00: i32,ch,glue = CopyFromReg # D:1 0x7f3c70425350, Register:i32 %367, 0x7f3c70425350:1
          0x7f3c70424da0: i32 = Register %367
          0x7f3c70425350: ch,glue = inlineasm # D:1 0x7f3c70424e10, TargetExternalSymbol:i64'; 4', MDNode:ch<null>, TargetConstant:i64<1>, TargetConstant:i32<1769482>, Register:i32 %367, TargetConstant:i32<-2147483639>, Register:i32 %368, 0x7f3c70424e10:1
            0x7f3c70424f60: i64 = TargetExternalSymbol'; 4'
            0x7f3c704303c0: i64 = TargetConstant<1>
            0x7f3c70424a20: i32 = TargetConstant<1769482>
            0x7f3c70424da0: i32 = Register %367
            0x7f3c704252e0: i32 = TargetConstant<-2147483639>
            0x7f3c7042b500: i32 = Register %368
            0x7f3c70424e10: ch,glue = CopyToReg # D:1 0x7f3c70430a50:1, Register:i32 %368, 0x7f3c7042b5e0
              0x7f3c7042b500: i32 = Register %368
              0x7f3c7042b5e0: i32 = bitcast # D:1 0x7f3c70424b70
                0x7f3c70424b70: f32 = fadd # D:1 0x7f3c704309e0, 0x7f3c7042b730
                  0x7f3c704309e0: f32 = fadd # D:1 0x7f3c70430200, 0x7f3c70425040


                  0x7f3c7042b730: f32 = bitcast # D:1 0x7f3c7042b0a0

        0x7f3c7042bb20: i32 = Constant<63>
  0x7f3c70430ac0: i32 = Constant<2147483647>
In function: main
Emergency stop

Apr 03 '24 21:04 sukamenev

Is it 32 or 64 bit atchitecture? need to track down which kernel fails.

Apr 04 '24 08:04 artyom-beilis

I also suggest to try AMD official drivers and not Mesa only.

I recall that for AMD 560 closed source drivers worked way better than Mesa ones. Also check of ROCm drivers still work on Fiji they are also better.

Apr 04 '24 09:04 artyom-beilis

Is it 32 or 64 bit atchitecture? need to track down which kernel fails.

My CPU have 64 bit architecture. GCN 3 (Fiji) - I don't know how many bit architecture.

Quote from AMD docs:

Every instruction is described with either 32 bits or 64 bits of microcode. • Vector Memory instructions are 64 bits. • Exports are 64 bits. • LDS and GDS are 64 bits. • Scalar ALU instructions are 32 bits but can have an additional 32 bits of literal constant data. • Vector ALU instructions can be 32 bits or 64 bits. The 32-bit versions can have an additional 32 bits of literal constant data.

Apr 04 '24 12:04 sukamenev

On AMD OpenCL from amdgpu-pro also error

python tests/validate_network.py --device privateuseone:3
Testing  resnet18
Accessing device #3:Fiji on AMD Accelerated Parallel Processing
Traceback (most recent call last):
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 280, in <module>
    main(r)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 221, in main
    train_on_images(m,batch,args.device,args.eval,iter_size = args.iter_size,opt_steps = args.opt,fwd=args.fwd)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 105, in train_on_images
    ref = step(model,data,labels,opt_steps,iter_size,fwd=fwd,test=test)
  File "/home/inetstar/Kamenev/programming/ZenDnn/pytorch_dlprim/tests/validate_network.py", line 85, in step
    loss.backward()
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/inetstar/Kamenev/programming/ZenDnn/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: could not create a primitive descriptor iterator

Apr 04 '24 14:04 sukamenev

I also suggest to try AMD official drivers and not Mesa only.

I recall that for AMD 560 closed source drivers worked way better than Mesa ones. Also check of ROCm drivers still work on Fiji they are also better.

Thank you! I got 8-9% speed impovement on amdgpu-pro OpenCL drivers.

Apr 04 '24 14:04 sukamenev