Int64-bit emulation for devices like Raspberry Pi 5 (VideoCore GPU VII)

Open russnes opened this issue 1 month ago • 1 comments

Clvk seems to work well for running OpenCL workloads on the GPU on the Raspberry Pi 5, except for the fact that the VideoCore GPU VII lacks Int64 support. This becomes a problem because it appears that Int64 is widely used in modern models, resulting in errors like this example trying to run an ONNX model with the tinygrad framework:

tinygrad.device.CompileError: OpenCL Compile Error

clvk-KBgEJn/source.cl:4:2: warning: no newline at end of file
    4 | }
    |  ^
Device does not support SPIR-V capability 11 (spv::CapabilityInt64).

Upon investigating the exact model that I'm using, it seems that it's not really taking advantage of the space in 64 bit integers, but is using Int64 as a sort of default for things like indexes and other things. This means that the model would still run quite efficiently using 32 bit integers.

To work around it I have converted all the Int64 to Int32 in my model and skipped the SPIR-V check allowing me to progress somewhat, but it appears that I still encounter errors like this for operations that tinygrad is trying to do:

Using LLVM at 'libLLVM.so.19.1'                                                                                                                                                                                                             
CLDevice: got 3 platforms and 1 devices                                                                                                                                                                                                     
CLDevice: opening V3D 7.1.10.2 with version 3.0 CLVK on Vulkan v1.3.311 driver 104861705                                                                                                                                                    
CL: using CLCompiler                                                                                                                                                                                                                        
opened device CL from pid:1009                                                                                                                                                                                                              
Using device: CL                                                                                                                                                                                                                            
scheduled 1 kernels in 3.25 ms
... omitting many lines before the end:
*** CL       480 copy        4,      CL <- DISK:/m              arg  2 mem   0.00 GB tm     95.46us/    29.77ms (      0 GFLOPS    0|0      GB/s) 
upcasting masked axis : 0
__kernel void E_3(__global long* data0_3, __global long* data1_4, __global int* data2_1) {
int val0 = (*(data2_1+0));
long val1 = (*(data1_4+0));
long val2 = (*(data1_4+1));
*(data0_3+0) = val1;
*(data0_3+1) = val2;
*(data0_3+2) = ((long)(val0));
}
*** CL       481 E_3                                            arg  3 mem   0.00 GB tm    114.17us/    29.88ms (      0 GFLOPS    0|0      GB/s) ['cat', 'tolist']
Cache miss for <Tensor <UOp CL (3,) long> on CL with grad None>
scheduled 3 kernels in 7.53 ms
*** DISK:/m  482 view        4 @ 4612                           arg  2 mem   0.00 GB 
*** CL       483 copy        4,      CL <- DISK:/m              arg  2 mem   0.00 GB tm    134.89us/    30.02ms (      0 GFLOPS    0|0      GB/s) 
*** CL       484 E                                              arg  2 mem   0.00 GB tm    113.74us/    30.13ms (      0 GFLOPS    0|0      GB/s) ['tolist']
Cache miss for <Tensor <UOp CL (1,) int> on CL with grad None>
scheduled 3 kernels in 7.32 ms
*** DISK:/m  485 view        4 @ 5646                           arg  2 mem   0.00 GB 
*** CL       486 copy        4,      CL <- DISK:/m              arg  2 mem   0.00 GB tm    148.11us/    30.28ms (      0 GFLOPS    0|0      GB/s) 
*** CL       487 E                                              arg  2 mem   0.00 GB tm    114.54us/    30.39ms (      0 GFLOPS    0|0      GB/s) ['tolist']
Cache miss for <Tensor <UOp CL (1,) int> on CL with grad None>
scheduled 2 kernels in 5.16 ms
*** CL       488 copy        8,      CL <- PYTHON               arg  2 mem   0.00 GB tm    247.46us/    30.64ms (      0 GFLOPS    0|0      GB/s) 
__kernel void En3(__global long* data0_1, __global long* data1_1) {
long val0 = (*(data1_1+0));
*(data0_1+0) = val0;
}
*** CL       489 En3                                            arg  2 mem   0.00 GB tm    107.24us/    30.75ms (      0 GFLOPS    0|0      GB/s) ['tolist']
Cache miss for <Tensor <UOp CL (1,) long> on CL with grad None>
scheduled 8 kernels in 29.64 ms
*** DISK:/m  490 view        4 @ 6321                           arg  2 mem   0.00 GB 
*** CL       491 copy       24,      CL <- PYTHON               arg  2 mem   0.00 GB tm    175.24us/    30.92ms (      0 GFLOPS    0|0      GB/s) 
*** DISK:/m  492 view        4 @ 5017                           arg  2 mem   0.00 GB 
*** DISK:/m  493 view        4 @ 5171                           arg  2 mem   0.00 GB 
*** CL       494 copy        4,      CL <- DISK:/m              arg  2 mem   0.00 GB tm    109.39us/    31.03ms (      0 GFLOPS    0|0      GB/s) 
*** CL       495 copy        4,      CL <- DISK:/m              arg  2 mem   0.00 GB tm     82.56us/    31.11ms (      0 GFLOPS    0|0      GB/s) 
*** CL       496 copy        4,      CL <- DISK:/m              arg  2 mem   0.00 GB tm     76.20us/    31.19ms (      0 GFLOPS    0|0      GB/s) 
upcasting masked axis : 0
__kernel void E_3n1(__global long* data0_3, __global int* data1_1, __global long* data2_3, __global int* data3_1, __global int* data4_1) {
int val0 = (*(data1_1+0));
int val1 = (*(data3_1+0));
int val2 = (*(data4_1+0));
long val3 = (*(data2_3+0));
long cast0 = ((long)(val0));
long cast1 = ((long)(val1));
long cast2 = ((long)(val2));
long cast3 = ((long)(-val0));
long alu0 = ((cast1!=cast3)?cast1:cast0);
*(data0_3+1) = alu0;
long alu2 = ((cast2!=cast3)?cast2:cast0);
*(data0_3+2) = alu2;
long alu4 = ((val3!=cast3)?val3:cast0);
*(data0_3+0) = alu4;
}
unknown NIR ALU inst: 64    %17 = i2i64 %10
Aborted                    CLVK_SKIP_SPIRV_CAPABILITY_CHECK=1 DEBUG=4 DEVICE=OPENCL python tinygrad_opencl_compute.py

Anyways, I'm not sure where it would be best to implement 64 bit emulation (or if that's viable at all) or what other ways you could get around this problem, but I am creating this issue here as a suggestion to further discussion!

Regards, Simen

Oct 31 '25 07:10 russnes

How you make it run, have you used extra flags or changed codes? You said you got problems with all workloads days ago

Nov 04 '25 18:11 void-512