Int64-bit emulation for devices like Raspberry Pi 5 (VideoCore GPU VII)
Clvk seems to work well for running OpenCL workloads on the GPU on the Raspberry Pi 5, except for the fact that the VideoCore GPU VII lacks Int64 support. This becomes a problem because it appears that Int64 is widely used in modern models, resulting in errors like this example trying to run an ONNX model with the tinygrad framework:
tinygrad.device.CompileError: OpenCL Compile Error
clvk-KBgEJn/source.cl:4:2: warning: no newline at end of file
4 | }
| ^
Device does not support SPIR-V capability 11 (spv::CapabilityInt64).
Upon investigating the exact model that I'm using, it seems that it's not really taking advantage of the space in 64 bit integers, but is using Int64 as a sort of default for things like indexes and other things. This means that the model would still run quite efficiently using 32 bit integers.
To work around it I have converted all the Int64 to Int32 in my model and skipped the SPIR-V check allowing me to progress somewhat, but it appears that I still encounter errors like this for operations that tinygrad is trying to do:
Using LLVM at 'libLLVM.so.19.1'
CLDevice: got 3 platforms and 1 devices
CLDevice: opening V3D 7.1.10.2 with version 3.0 CLVK on Vulkan v1.3.311 driver 104861705
CL: using CLCompiler
opened device CL from pid:1009
Using device: CL
scheduled 1 kernels in 3.25 ms
... omitting many lines before the end:
*** CL 480 copy 4, CL <- DISK:/m arg 2 mem 0.00 GB tm 95.46us/ 29.77ms ( 0 GFLOPS 0|0 GB/s)
upcasting masked axis : 0
__kernel void E_3(__global long* data0_3, __global long* data1_4, __global int* data2_1) {
int val0 = (*(data2_1+0));
long val1 = (*(data1_4+0));
long val2 = (*(data1_4+1));
*(data0_3+0) = val1;
*(data0_3+1) = val2;
*(data0_3+2) = ((long)(val0));
}
*** CL 481 E_3 arg 3 mem 0.00 GB tm 114.17us/ 29.88ms ( 0 GFLOPS 0|0 GB/s) ['cat', 'tolist']
Cache miss for <Tensor <UOp CL (3,) long> on CL with grad None>
scheduled 3 kernels in 7.53 ms
*** DISK:/m 482 view 4 @ 4612 arg 2 mem 0.00 GB
*** CL 483 copy 4, CL <- DISK:/m arg 2 mem 0.00 GB tm 134.89us/ 30.02ms ( 0 GFLOPS 0|0 GB/s)
*** CL 484 E arg 2 mem 0.00 GB tm 113.74us/ 30.13ms ( 0 GFLOPS 0|0 GB/s) ['tolist']
Cache miss for <Tensor <UOp CL (1,) int> on CL with grad None>
scheduled 3 kernels in 7.32 ms
*** DISK:/m 485 view 4 @ 5646 arg 2 mem 0.00 GB
*** CL 486 copy 4, CL <- DISK:/m arg 2 mem 0.00 GB tm 148.11us/ 30.28ms ( 0 GFLOPS 0|0 GB/s)
*** CL 487 E arg 2 mem 0.00 GB tm 114.54us/ 30.39ms ( 0 GFLOPS 0|0 GB/s) ['tolist']
Cache miss for <Tensor <UOp CL (1,) int> on CL with grad None>
scheduled 2 kernels in 5.16 ms
*** CL 488 copy 8, CL <- PYTHON arg 2 mem 0.00 GB tm 247.46us/ 30.64ms ( 0 GFLOPS 0|0 GB/s)
__kernel void En3(__global long* data0_1, __global long* data1_1) {
long val0 = (*(data1_1+0));
*(data0_1+0) = val0;
}
*** CL 489 En3 arg 2 mem 0.00 GB tm 107.24us/ 30.75ms ( 0 GFLOPS 0|0 GB/s) ['tolist']
Cache miss for <Tensor <UOp CL (1,) long> on CL with grad None>
scheduled 8 kernels in 29.64 ms
*** DISK:/m 490 view 4 @ 6321 arg 2 mem 0.00 GB
*** CL 491 copy 24, CL <- PYTHON arg 2 mem 0.00 GB tm 175.24us/ 30.92ms ( 0 GFLOPS 0|0 GB/s)
*** DISK:/m 492 view 4 @ 5017 arg 2 mem 0.00 GB
*** DISK:/m 493 view 4 @ 5171 arg 2 mem 0.00 GB
*** CL 494 copy 4, CL <- DISK:/m arg 2 mem 0.00 GB tm 109.39us/ 31.03ms ( 0 GFLOPS 0|0 GB/s)
*** CL 495 copy 4, CL <- DISK:/m arg 2 mem 0.00 GB tm 82.56us/ 31.11ms ( 0 GFLOPS 0|0 GB/s)
*** CL 496 copy 4, CL <- DISK:/m arg 2 mem 0.00 GB tm 76.20us/ 31.19ms ( 0 GFLOPS 0|0 GB/s)
upcasting masked axis : 0
__kernel void E_3n1(__global long* data0_3, __global int* data1_1, __global long* data2_3, __global int* data3_1, __global int* data4_1) {
int val0 = (*(data1_1+0));
int val1 = (*(data3_1+0));
int val2 = (*(data4_1+0));
long val3 = (*(data2_3+0));
long cast0 = ((long)(val0));
long cast1 = ((long)(val1));
long cast2 = ((long)(val2));
long cast3 = ((long)(-val0));
long alu0 = ((cast1!=cast3)?cast1:cast0);
*(data0_3+1) = alu0;
long alu2 = ((cast2!=cast3)?cast2:cast0);
*(data0_3+2) = alu2;
long alu4 = ((val3!=cast3)?val3:cast0);
*(data0_3+0) = alu4;
}
unknown NIR ALU inst: 64 %17 = i2i64 %10
Aborted CLVK_SKIP_SPIRV_CAPABILITY_CHECK=1 DEBUG=4 DEVICE=OPENCL python tinygrad_opencl_compute.py
Anyways, I'm not sure where it would be best to implement 64 bit emulation (or if that's viable at all) or what other ways you could get around this problem, but I am creating this issue here as a suggestion to further discussion!
Regards, Simen
How you make it run, have you used extra flags or changed codes? You said you got problems with all workloads days ago