TornadoVM icon indicating copy to clipboard operation
TornadoVM copied to clipboard

FPGA result is not correct in JDK 22 branch, due to casting of the address type

Open stratika opened this issue 7 months ago • 0 comments

Describe the bug

I tried the experimental branch of TornadoVM working with JDK 22, and noticed that the result of the generated OpenCL kernels for FPGAs are not result.

How To Reproduce

I have built TornadoVM as follows:

git checkout feat/jdk22_2nd_iteration
./bin/tornadovm-installer --jdk sapmachine-jdk-22 --backend opencl && source setvars.sh
tornado --version

returns

version=1.0.7-dev
branch=feat/jdk22_2nd_iteration
commit=dec5a87

Backends installed: 
	 - opencl

and I have used the oneAPI base toolkit to see also the Intel FPGA in emulation mode:

tornado --devices
WARNING: Using incubator modules: jdk.incubator.vector

Number of Tornado drivers: 1
Driver: OpenCL
  Total number of OpenCL devices  : 4
  Tornado device=0:0  (DEFAULT)
	OPENCL --  [NVIDIA CUDA] -- NVIDIA RTX A2000 8GB Laptop GPU
		Global Memory Size: 7.8 GB
		Local Memory Size: 48.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [1024]
		Max WorkGroup Configuration: [1024, 1024, 64]
		Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:1
	OPENCL --  [Intel(R) OpenCL HD Graphics] -- Intel(R) Graphics [0x46a6]
		Global Memory Size: 24.8 GB
		Local Memory Size: 64.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [512]
		Max WorkGroup Configuration: [512, 512, 512]
		Device OpenCL C version: OpenCL C 1.2

  Tornado device=0:2
	OPENCL --  [Intel(R) OpenCL] -- 12th Gen Intel(R) Core(TM) i9-12900H
		Global Memory Size: 31.0 GB
		Local Memory Size: 32.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [8192]
		Max WorkGroup Configuration: [8192, 8192, 8192]
		Device OpenCL C version: OpenCL C 3.0

  Tornado device=0:3
	OPENCL --  [Intel(R) FPGA Emulation Platform for OpenCL(TM)] -- Intel(R) FPGA Emulation Device
		Global Memory Size: 31.0 GB
		Local Memory Size: 256.0 KB
		Workgroup Dimensions: 3
		Total Number of Block Threads: [67108864]
		Max WorkGroup Configuration: [67108864, 67108864, 67108864]
		Device OpenCL C version: OpenCL C 1.2

Then, I ran the DFTMT.java class from the dynamic examples (uk.ac.manchester.tornado.examples.dynamic) on the FPGA emulation platform, and got the output as follows:

tornado --threadInfo --printKernel --jvm="-Ds0.t0.device=0:3" -m tornado.examples/uk.ac.manchester.tornado.examples.dynamic.DFTMT --params="1024 default 1"
WARNING: Using incubator modules: jdk.incubator.vector
Initialization time:  2021238628 ns

Version running: default ! 
#pragma OPENCL EXTENSION cl_khr_fp64 : enable  
#pragma OPENCL EXTENSION cl_khr_fp16 : enable  
__attribute__((reqd_work_group_size(64, 1, 1)))
__kernel void computeDFT(__global long *_kernel_context, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics, __global uchar *inreal, __global uchar *inimag, __global uchar *outreal, __global uchar *outimag, __global uchar *inputSize)
{
  int i_4, i_36, i_5, i_6, i_10, i_11, i_12, i_31, i_32, i_33; 
  ulong ul_13, ul_0, ul_15, ul_2, ul_34, ul_1, ul_3, ul_35; 
  float f_21, f_22, f_23, f_24, f_17, f_18, f_19, f_20, f_14, f_16, f_9, f_7, f_8, f_29, f_30, f_25, f_26, f_27, f_28; 

  // BLOCK 0
  ul_0  =  (ulong) inreal;
  ul_1  =  (ulong) inimag;
  ul_2  =  (ulong) outreal;
  ul_3  =  (ulong) outimag;
  i_4  =  get_global_size(0);
  i_5  =  get_global_id(0);
  // BLOCK 1 MERGES [0 5 ]
  i_6  =  i_5;
  // BLOCK 2
  f_7  =  (float) i_6;
  // BLOCK 3 MERGES [2 4 ]
  f_8  =  0.0F;
  f_9  =  0.0F;
  i_10  =  0;
  #pragma unroll 4
  for(;i_10 < 1024;)
  {
    // BLOCK 4
    i_11  =  i_10 + 6;
    i_12  =  i_11 << 2;
    ul_13  =  ul_0 + i_12;
    f_14  =  *((__global float *) ul_13);
    ul_15  =  ul_1 + i_12;
    f_16  =  *((__global float *) ul_15);
    f_17  =  *((__global float *) ul_13);
    f_18  =  *((__global float *) ul_15);
    f_19  =  (float) i_10;
    f_20  =  f_19 * 6.2831855F;
    f_21  =  f_20 * f_7;
    f_22  =  f_21 / 1024.0F;
    f_23  =  native_sin(f_22);
    f_24  =  native_cos(f_22);
    f_25  =  f_24 * f_18;
    f_26  =  fma(f_23, f_17, f_25);
    f_27  =  f_9 - f_26;
    f_28  =  f_16 * f_23;
    f_29  =  fma(f_24, f_14, f_28);
    f_30  =  f_8 + f_29;
    i_31  =  i_10 + 1;
    f_8  =  f_30;
    f_9  =  f_27;
    i_10  =  i_31;
  }  // B4
  // BLOCK 5
  i_32  =  i_6 + 6;
  i_33  =  i_32 << 2;
  ul_34  =  ul_2 + i_33;
  *((__global float *) ul_34)  =  f_8;
  ul_35  =  ul_3 + i_33;
  *((__global float *) ul_35)  =  f_9;
  i_36  =  i_4 + i_6;
  i_6  =  i_36;
  // BLOCK 6
  return;
}  //  kernel

Task info: s0.t0
	Backend           : OPENCL
	Device            : Intel(R) FPGA Emulation Device CL_DEVICE_TYPE_ACCELERATOR (available)
	Dims              : 1
	Global work offset: [0]
	Global work size  : [1024]
	Local  work size  : [64, 1, 1]
	Number of workgroups  : [16]

Total time:  441707758 ns 

-6.5101576 vs -1.5097469

Validation:  FAIL 

Expected behavior

I did a diff with the kernel for the same test as generated in the develop branch. The first observation is that there is a casting on the values used to calculate the address (ul_14) that is used for loading the data. See variable l_12, in the following screenshot.

image

I guess, the expected behavior would be to avoid this casting.

Computing system setup (please complete the following information):

  • OS: Ubuntu 23.10
  • OpenCL version: OpenCL 1.2
  • Driver: 2023.16.10.0.17_160000
  • TornadoVM commit id: dec5a87 from branch feat/jdk22_2nd_iteration

stratika avatar Jul 31 '24 13:07 stratika