aparapi icon indicating copy to clipboard operation
aparapi copied to clipboard

[BOUNTY $25] Kernel error for MonteCarlo on NVIDIA and AMD GPUs

Open freemo opened this issue 8 years ago • 7 comments

From @jjfumero on February 17, 2016 10:23

I am running MonteCarlo simulation within Aparapi. For testing I am using Intel OpenCL locally. I am running with JDK 1.8_65. The kernel that Aparapi generates is correct and the result when I compare to the sequential code is correct as well. However if I use the GPU, NVidia GPU or AMD GPU, the kernel is not correct. One declaration type is missing.

My understanding is, Aparapi generates the OpenCL kernel indendently of the architecture behind. Bytecodes -> C OpenCL. Is that correct? or is there any communication during the code generation?

Here the details, this is the Aparapi Kernel:

public static class MonteCarloKernel extends Kernel {

        private int size;
        private float[] result;

        public MonteCarloKernel(int size) {
            this.size = size;
            result = new float[size];
        }

        @Override
        public void run() {
            int idx = getGlobalId();
            int iter = 25000;

            long seed = idx;
            float sum = 0.0f;

            for (int j = 0; j < iter; ++j) {
                // generate a pseudo random number (you do need it twice)
                seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
                seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);

                // this generates a number between 0 and 1 (with an awful entropy)
                float x = ((float) (seed & 0x0FFFFFFF)) / 268435455f;

                // repeat for y
                seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
                seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
                float y = ((float) (seed & 0x0FFFFFFF)) / 268435455f;

                float dist = (float) Math.sqrt(x * x + y * y);
                if (dist <= 1.0f)
                    sum += 1.0f;
            }
            sum *= 4;
            result[idx] = (float) sum / (float) iter;
        }

        public boolean checkResult(float[] seq) {
            for (int i = 0; i < seq.length; i++) {
                if (Math.abs( (float)(result[i] - seq[i])) > 0.001) {
                    return false;
                }
            }
            return true;
        }

        public float[] getResult() {
            return result;
        }

        public int getSize() {
            return size;
        }
    }


If I use Intel OpenCL:

NAME: Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz VENDOR: Intel(R) Corporation TYPE: CPU DRIVER: 1.2.0.57

This is the kernel Aparapi generates (the correct kernel):


#pragma OPENCL EXTENSION cl_khr_fp64 : enable

typedef struct This_s{
   __global float *result;
   int passid;
}This;
int get_pass_id(This *this){
   return this->passid;
}
__kernel void run(
   __global float *result, 
   int passid
){
   This thisStruct;
   This* this=&thisStruct;
   this->result = result;
   this->passid = passid;
   {
      int idx = get_global_id(0);
      int iter = 25000;
      long seed = (long)idx;
      float sum = 0.0f;
      for (int j = 0; j<iter; j++){
         seed = ((seed * 25214903917L) + 11L) & 281474976710655L;
         seed = ((seed * 25214903917L) + 11L) & 281474976710655L;
         float x = (float)(seed & 268435455L) / 2.68435456E8f;
         seed = ((seed * 25214903917L) + 11L) & 281474976710655L;
         seed = ((seed * 25214903917L) + 11L) & 281474976710655L;
         float y = (float)(seed & 268435455L) / 2.68435456E8f;
         float dist = (float)sqrt((double)((x * x) + (y * y)));
         if (dist<=1.0f){
            sum = sum + 1.0f;
         }
      }
      sum = sum * 4.0f;
      this->result[idx]  = sum / (float)iter;
      return;
   }
}

When I use Aparapi on NVIDIA or AMD GPUs (same JVM - JDK 1.8.65, but different driver), I get this kernel:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

typedef struct This_s{
   __global float *result;
   int passid;
}This;
int get_pass_id(This *this){
   return this->passid;
}
__kernel void run(
   __global float *result, 
   int passid
){
   This thisStruct;
   This* this=&thisStruct;
   this->result = result;
   this->passid = passid;
   {
      int i_1 = get_global_id(0);
      int i_2 = 25000;
       l_3 = (long)i_1;
      float f_5 = 0.0f;
      int i_6 = 0;
      for (; i_6<i_2; i_6++){
         l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L;
         l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L;
         float f_7 = (float)(l_3 & 268435455L) / 2.68435456E8f;
         l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L;
         l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L;
         float f_8 = (float)(l_3 & 268435455L) / 2.68435456E8f;
         float f_9 = (float)sqrt((double)((f_7 * f_7) + (f_8 * f_8)));
         if (f_9<=1.0f){
            f_5 = f_5 + 1.0f;
         }
      }
      f_5 = f_5 * 4.0f;
      this->result[i_1]  = f_5 / (float)i_2;
      return;
   }
}

There is an error:

clBuildProgram failed
************************************************
:21:8: error: use of undeclared identifier 'l_3'
       l_3 = (long)i_1;
       ^

Note: NVIDIA-SMI 331.79 Driver Version: 331.79

AMD:

Name: Hawaii Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 1.2 Driver version: 1598.5 (VM)

Copied from original issue: aparapi/aparapi#25

freemo avatar Dec 02 '16 23:12 freemo

From @SubaruWRC on February 17, 2016 12:53

Hmm, so it's not declaring the variable as a long. Does this also occur if you declare the value as a double?

Sent from my iPhone --- Please excuse any typos or autocorrect mistakes

On Feb 17, 2016, at 2:23 AM, Juan Fumero [email protected] wrote:

I am running MonteCarlo simulation within Aparapi. For testing I am using Intel OpenCL locally. I am running with JDK 1.8_65. The kernel that Aparapi generates is correct and the result when I compare to the sequential code is correct as well. However if I use the GPU, NVidia GPU or AMD GPU, the kernel is not correct. One declaration type is missing.

My understanding is, Aparapi generates the OpenCL kernel indendently of the architecture behind. Bytecodes -> C OpenCL. Is that correct? or is there any communication during the code generation?

Here the details, this is the Aparapi Kernel:

public static class MonteCarloKernel extends Kernel {

    private int size;
    private float[] result;

    public MonteCarloKernel(int size) {
        this.size = size;
        result = new float[size];
    }

    @Override
    public void run() {
        int idx = getGlobalId();
        int iter = 25000;

        long seed = idx;
        float sum = 0.0f;

        for (int j = 0; j < iter; ++j) {
            // generate a pseudo random number (you do need it twice)
            seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
            seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);

            // this generates a number between 0 and 1 (with an awful entropy)
            float x = ((float) (seed & 0x0FFFFFFF)) / 268435455f;

            // repeat for y
            seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
            seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
            float y = ((float) (seed & 0x0FFFFFFF)) / 268435455f;

            float dist = (float) Math.sqrt(x * x + y * y);
            if (dist <= 1.0f)
                sum += 1.0f;
        }
        sum *= 4;
        result[idx] = (float) sum / (float) iter;
    }

    public boolean checkResult(float[] seq) {
        for (int i = 0; i < seq.length; i++) {
            if (Math.abs( (float)(result[i] - seq[i])) > 0.001) {
                return false;
            }
        }
        return true;
    }

    public float[] getResult() {
        return result;
    }

    public int getSize() {
        return size;
    }
}

If I use Intel OpenCL:

NAME: Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz VENDOR: Intel(R) Corporation TYPE: CPU DRIVER: 1.2.0.57

This is the kernel Aparapi generates (the correct kernel):

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

typedef struct This_s{ __global float _result; int passid; }This; int get_pass_id(This *this){ return this->passid; } __kernel void run( _global float *result, int passid ){ This thisStruct; This this=&thisStruct; this->result = result; this->passid = passid; { int idx = get_global_id(0); int iter = 25000; long seed = (long)idx; float sum = 0.0f; for (int j = 0; j<iter; j++){ seed = ((seed * 25214903917L) + 11L) & 281474976710655L; seed = ((seed * 25214903917L) + 11L) & 281474976710655L; float x = (float)(seed & 268435455L) / 2.68435456E8f; seed = ((seed * 25214903917L) + 11L) & 281474976710655L; seed = ((seed * 25214903917L) + 11L) & 281474976710655L; float y = (float)(seed & 268435455L) / 2.68435456E8f; float dist = (float)sqrt((double)((x * x) + (y * y))); if (dist<=1.0f){ sum = sum + 1.0f; } } sum = sum * 4.0f; this->result[idx] = sum / (float)iter; return; } }

When I use Aparapi on NVIDIA or AMD GPUs (same JVM - JDK 1.8.65, but different driver), I get this kernel:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

typedef struct This_s{ __global float _result; int passid; }This; int get_pass_id(This *this){ return this->passid; } __kernel void run( _global float *result, int passid ){ This thisStruct; This this=&thisStruct; this->result = result; this->passid = passid; { int i_1 = get_global_id(0); int i_2 = 25000; l_3 = (long)i_1; float f_5 = 0.0f; int i_6 = 0; for (; i_6<i_2; i_6++){ l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L; l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L; float f_7 = (float)(l_3 & 268435455L) / 2.68435456E8f; l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L; l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L; float f_8 = (float)(l_3 & 268435455L) / 2.68435456E8f; float f_9 = (float)sqrt((double)((f_7 * f_7) + (f_8 * f_8))); if (f_9<=1.0f){ f_5 = f_5 + 1.0f; } } f_5 = f_5 * 4.0f; this->result[i_1] = f_5 / (float)i_2; return; } } There is an error:

clBuildProgram failed


:21:8: error: use of undeclared identifier 'l_3' l_3 = (long)i_1; ^ Note: NVIDIA-SMI 331.79 Driver Version: 331.79

AMD:

Name: Hawaii Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 1.2 Driver version: 1598.5 (VM)

— Reply to this email directly or view it on GitHub.

freemo avatar Dec 02 '16 23:12 freemo

From @ericcaspole on February 17, 2016 13:13

I think you must be building the Aparapi code slightly differently, in the first case that works you have symbols, and in the failing case you have the generated symbol names like i_1 etc. See if you can build the second case with javac -g and see what happens. Regards, Eric 

Sent from Samsung tablet.

-------- Original message -------- From: Juan Fumero [email protected] Date: 2/17/2016 5:23 AM (GMT-05:00) To: aparapi/aparapi [email protected] Subject: [aparapi] Kernel error for MonteCarlo on NVIDIA and AMD GPUs (#25)

I am running MonteCarlo simulation within Aparapi. For testing I am using Intel OpenCL locally. I am running with JDK 1.8_65. The kernel that Aparapi generates is correct and the result when I compare to the sequential code is correct as well. However if I use the GPU, NVidia GPU or AMD GPU, the kernel is not correct. One declaration type is missing.

My understanding is, Aparapi generates the OpenCL kernel indendently of the architecture behind. Bytecodes -> C OpenCL. Is that correct? or is there any communication during the code generation?

Here the details, this is the Aparapi Kernel:

public static class MonteCarloKernel extends Kernel {

    private int size;
    private float[] result;

    public MonteCarloKernel(int size) {
        this.size = size;
        result = new float[size];
    }

    @Override
    public void run() {
        int idx = getGlobalId();
        int iter = 25000;

        long seed = idx;
        float sum = 0.0f;

        for (int j = 0; j < iter; ++j) {
            // generate a pseudo random number (you do need it twice)
            seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
            seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);

            // this generates a number between 0 and 1 (with an awful entropy)
            float x = ((float) (seed & 0x0FFFFFFF)) / 268435455f;

            // repeat for y
            seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
            seed = (seed * 0x5DEECE66DL + 0xBL) & ((1L << 48) - 1);
            float y = ((float) (seed & 0x0FFFFFFF)) / 268435455f;

            float dist = (float) Math.sqrt(x * x + y * y);
            if (dist <= 1.0f)
                sum += 1.0f;
        }
        sum *= 4;
        result[idx] = (float) sum / (float) iter;
    }

    public boolean checkResult(float[] seq) {
        for (int i = 0; i < seq.length; i++) {
            if (Math.abs( (float)(result[i] - seq[i])) > 0.001) {
                return false;
            }
        }
        return true;
    }

    public float[] getResult() {
        return result;
    }

    public int getSize() {
        return size;
    }
}

If I use Intel OpenCL:

NAME: Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz

VENDOR: Intel(R) Corporation

TYPE: CPU

DRIVER: 1.2.0.57

This is the kernel Aparapi generates (the correct kernel):

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

typedef struct This_s{ __global float _result; int passid; }This; int get_pass_id(This *this){ return this->passid; } __kernel void run( _global float *result, int passid ){ This thisStruct; This this=&thisStruct; this->result = result; this->passid = passid; { int idx = get_global_id(0); int iter = 25000; long seed = (long)idx; float sum = 0.0f; for (int j = 0; j<iter; j++){ seed = ((seed * 25214903917L) + 11L) & 281474976710655L; seed = ((seed * 25214903917L) + 11L) & 281474976710655L; float x = (float)(seed & 268435455L) / 2.68435456E8f; seed = ((seed * 25214903917L) + 11L) & 281474976710655L; seed = ((seed * 25214903917L) + 11L) & 281474976710655L; float y = (float)(seed & 268435455L) / 2.68435456E8f; float dist = (float)sqrt((double)((x * x) + (y * y))); if (dist<=1.0f){ sum = sum + 1.0f; } } sum = sum * 4.0f; this->result[idx] = sum / (float)iter; return; } }

When I use Aparapi on NVIDIA or AMD GPUs (same JVM - JDK 1.8.65, but different driver), I get this kernel:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

typedef struct This_s{ __global float _result; int passid; }This; int get_pass_id(This *this){ return this->passid; } __kernel void run( _global float *result, int passid ){ This thisStruct; This this=&thisStruct; this->result = result; this->passid = passid; { int i_1 = get_global_id(0); int i_2 = 25000; l_3 = (long)i_1; float f_5 = 0.0f; int i_6 = 0; for (; i_6<i_2; i_6++){ l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L; l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L; float f_7 = (float)(l_3 & 268435455L) / 2.68435456E8f; l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L; l_3 = ((l_3 * 25214903917L) + 11L) & 281474976710655L; float f_8 = (float)(l_3 & 268435455L) / 2.68435456E8f; float f_9 = (float)sqrt((double)((f_7 * f_7) + (f_8 * f_8))); if (f_9<=1.0f){ f_5 = f_5 + 1.0f; } } f_5 = f_5 * 4.0f; this->result[i_1] = f_5 / (float)i_2; return; } }

There is an error:

clBuildProgram failed


:21:8: error: use of undeclared identifier 'l_3' l_3 = (long)i_1; ^

Note:

NVIDIA-SMI 331.79 Driver Version: 331.79

AMD:

Name: Hawaii

Vendor: Advanced Micro Devices, Inc.

Device OpenCL C version: OpenCL C 1.2

Driver version: 1598.5 (VM)

— Reply to this email directly or view it on GitHub.

freemo avatar Dec 02 '16 23:12 freemo

From @jjfumero on February 17, 2016 15:19

Hi @ericcaspole, with javac -g, it works! The kernel is correct (same version as the first one). I have other kernels running such NBody, Kmeans and Blackcholes on NVIDIA and AMD. Those I do not need the debug info. Only in this case, Montecarlo.

Thanks

freemo avatar Dec 02 '16 23:12 freemo

From @grfrost on February 17, 2016 15:31

It looks like a bug trying to determine the scope of the var.

I have a workaround which I think will work.

Replace

long seed = (long)idx;

With

long seed = 0L;

seed = (long)idx;

For some reason I think the cast is confusing the bytecode parser and it is not seeing this as a declaration.

Let me know if this works. It should be a fairly simple fix.

Gary

On Wed, Feb 17, 2016 at 7:19 AM, Juan Fumero [email protected] wrote:

Hi @ericcaspole https://github.com/ericcaspole, with javac -g, it works! The kernel is correct (same version as the first one). I have other kernels running such NBody, Kmeans and Blackcholes on NVIDIA and AMD. Those I do not need the debug info. Only in this case, Montecarlo.

Thanks

— Reply to this email directly or view it on GitHub https://github.com/aparapi/aparapi/issues/25#issuecomment-185250288.

freemo avatar Dec 02 '16 23:12 freemo

From @grfrost on February 17, 2016 15:34

Eric is correct.

When we have local var info (-g) we can use this table to confirm the scope of vars.

Without this we try to make a 'fake one'. It looks like this creation of a fake local variable table is broken here. My suggestion above, should make it much easier for the code which creates the 'fake' table.

Gary

On Wed, Feb 17, 2016 at 7:31 AM, Gary Frost [email protected] wrote:

It looks like a bug trying to determine the scope of the var.

I have a workaround which I think will work.

Replace

long seed = (long)idx;

With

long seed = 0L;

seed = (long)idx;

For some reason I think the cast is confusing the bytecode parser and it is not seeing this as a declaration.

Let me know if this works. It should be a fairly simple fix.

Gary

On Wed, Feb 17, 2016 at 7:19 AM, Juan Fumero [email protected] wrote:

Hi @ericcaspole https://github.com/ericcaspole, with javac -g, it works! The kernel is correct (same version as the first one). I have other kernels running such NBody, Kmeans and Blackcholes on NVIDIA and AMD. Those I do not need the debug info. Only in this case, Montecarlo.

Thanks

— Reply to this email directly or view it on GitHub https://github.com/aparapi/aparapi/issues/25#issuecomment-185250288.

freemo avatar Dec 02 '16 23:12 freemo

From @jjfumero on February 17, 2016 15:41

Hi Gary, thanks for the explanation. I tried your solution, but the kernel is still not correct (unless I compile with -g) This is part the auto-generated kernel.

      l_3 = 0L; 
      l_3 = (long)i_1;
      float f_5 = 0.0f;

Juan

freemo avatar Dec 02 '16 23:12 freemo

From @grfrost on February 17, 2016 18:7

Thanks for getting back. Hopefully -g will work for you in the short term.

I will try to take a look at the issue here.

Gary On Feb 17, 2016 7:41 AM, "Juan Fumero" [email protected] wrote:

Hi Gary, thanks for the explanation. I tried your solution, but the kernel is still not correct (unless I compile with -g) This is part the auto-generated kernel.

  l_3 = 0L;
  l_3 = (long)i_1;
  float f_5 = 0.0f;

Juan

— Reply to this email directly or view it on GitHub https://github.com/aparapi/aparapi/issues/25#issuecomment-185261426.

freemo avatar Dec 02 '16 23:12 freemo