aparapi icon indicating copy to clipboard operation
aparapi copied to clipboard

[Bounty $50] Inconsistent results between GPU and CPU when integers overflow.

Open freemo opened this issue 8 years ago • 62 comments

The following code produces different results when run on the GPU vs the CPU.

import com.aparapi.*;

public class Main {
    public static void main(String[] args) {
        int num = 1;

        final long[] result = new long[num];
        final int start = Integer.MAX_VALUE;

        Kernel kernel = new Kernel() {
            @Override
            public void run() {
                final int id = getGlobalId();
                result[id] = calculate(start + id);
            }
        };
        kernel.execute(num);

        System.out.println( "expected: " +  calculate(start) + " result: " + result[0]);
    }

    public static long calculate(int tc) {
        return (long) tc * 100;
    }
}

The output from the above code snippet is:

expected: 214748364700 result: 4294967196

I tested this on my Macbook pro but others noticed the problem as well on other unspecified platforms. Also changin the calculate function such that 100 is a long rather than an integer with return (long) tc * 100l; (notice the letter l at the end of the 100) will produce the exact same incorrect results as above.

freemo avatar Dec 30 '16 15:12 freemo

IT should be noted this is not simply an issue with rolling over. if I change the calculate line to return (long) tc + 212600881053l; instead it should produce the same result mathematically using addition rather than multiplication. Despite this the program actually runs successfully with this new edit producing the following result:

expected: 214748364700 result: 214748364700

freemo avatar Dec 30 '16 15:12 freemo

Hi, thanks for opening this issue to here.

I think there is something wrong in OpenCl. I am sending kernel and host code in C language. It STILL gives wrong results on GPU.

kernel.zip

savaskoc avatar Dec 30 '16 22:12 savaskoc

What opencl implementation are you using and what cpu/gpu does the macbook have?

CC007 avatar Jan 02 '17 16:01 CC007

It's default implementation in macOS. I'm using 10.12 (16A320) MacBook Pro (Retina, 13-inch, Late 2013). It has i5 4258U CPU and Iris 5100 GPU

savaskoc avatar Jan 02 '17 16:01 savaskoc

Are you using the amd app sdk?

CC007 avatar Jan 02 '17 17:01 CC007

Also in the calculate can you try it wi brackets around the cast, putting the multiplication outside of the brackets?

CC007 avatar Jan 02 '17 17:01 CC007

@CC007 The issue seems to occur on linux as well and on AMD App SDK. It appears this bug is not platform specific. IF you runt he code in the original post on your local computer will probably see the bug as well. Did you try? the bug behaves very oddly for me. for example even though id is always 0 if you remove the "+ id" part int he calculate call it wont break anymore.

freemo avatar Jan 02 '17 18:01 freemo

Anyone check the c code i posted?

savaskoc avatar Jan 02 '17 18:01 savaskoc

@savaskoc Not yet but i will give it a go this evening. If your saying it produces the same incorrect results however then I expect you to be correct that it is an opencl issue directly.

freemo avatar Jan 02 '17 18:01 freemo

I think this is an OpenCl issue because GPU produces incorrect results independently from platform and/or language (C, Java, Python etc.).

savaskoc avatar Jan 02 '17 18:01 savaskoc

@savaskoc You realize the bug is producing the result but clipping it to 32 bits? To use the test I posted above as the example (the same is true for the numbers you posted) here is the breakdown.

Expected result is 214748364700 which in binary would be:

0011 0001 1111 1111 1111 1111 1111 1111 1001 1100

actual result we get is 4294967196 which in binary is:

0000 0000 1111 1111 1111 1111 1111 1111 1001 1100

Basically just drops all but the last 32 bits.

So while this is a legitimate bug, it is definitely occurring due to mishandling of 64bit variables.

freemo avatar Jan 02 '17 18:01 freemo

I'm aware of that. I tried another types than long but still gives incorrect results. Maybe some GPU's can't process 64 bit data types?

savaskoc avatar Jan 02 '17 18:01 savaskoc

@savaskoc I think there is more to it than that. I think when i was testing i tried some variants that produced some very odd results. I need to test this again to make sure I'm remembering correctly but when I removed the id variable (which in this test is always 0 anyway so shouldnt make a difference) it actually caused the correct results to be produced. Once I saw that behavior it was apparent that we were talking about a legitimate bug, I just cant confirm yet if the bug is isolated to aparapi or an OpenCL issue in general yet (I need to do more testing).

freemo avatar Jan 02 '17 18:01 freemo

@savaskoc also as I stated in the OP if you change the operation to addition but change the value of the operand to make it mathematically equivalent, it magically works. So the problem seems to occur only on multiplication but not addition. this leads me to believe it is a genuine error rather than a hardware compatibility issue or something.

freemo avatar Jan 02 '17 18:01 freemo

Does opencl provide software 64bit support or hardware 64bit support (or both)?

CC007 avatar Jan 02 '17 18:01 CC007

My guess is that savaskoc is correct. I think that the GPU OpenCL runtime does not support long.

Aparapi will detect this for doubles, surprised it does not detect this for long.

grfrost avatar Jan 02 '17 18:01 grfrost

@grfrost If I am correct, why (long) tc + 212600881053l; line works but (long) tc * 100; not?

savaskoc avatar Jan 02 '17 18:01 savaskoc

Actually, I was about to retract ;) from the 1.0 spec https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/scalarDataTypes.html it does not look like cl_long is optional. So now I think I will blame Aparapi. My guess is that as the AST is built we end up with

     (long)
        | 
        *
 tc         100

Instead of

             *
         /        \
    (long)         100
        |
       tc

Can you try

return 100 * (long) tc ;

and/or

return (long)(tc+0L) * 100;

(hope my diagrams make it unscathed)

grfrost avatar Jan 02 '17 18:01 grfrost

I doubt. even original OpenCl does not produce correct results. Check my kernel above.

savaskoc avatar Jan 02 '17 18:01 savaskoc

I ran the code on my computer (the code from freemo) and it runs as it should.

Using: CPU: 4670k GPU: 1070GTX Nvidia and intel openCL implementation

CC007 avatar Jan 02 '17 19:01 CC007

@CC007 You're saying that you get same results both GPU and CPU mode right? Can you try my kernel in C?

savaskoc avatar Jan 02 '17 19:01 savaskoc

I will try that one next. I also tested freemo's code by setting it up to use the cpu opencl device

CC007 avatar Jan 02 '17 19:01 CC007

It generated within build phase. Did you add kernel.cl to compile sources? screen shot 2017-01-02 at 22 57 56

savaskoc avatar Jan 02 '17 19:01 savaskoc

@savaskoc The problem seems to be that I don't have any opencl sdk, only the driver and implementation afaik, as I am not using the AMD app SDK

CC007 avatar Jan 02 '17 20:01 CC007

I have a pc :)

CC007 avatar Jan 02 '17 20:01 CC007

I think you misunderstand, It wasn't the kernel.cl.h that caused a problem. I dont have the OpenCL/opencl.h, because I don't have an opencl SDK. I'm installing one now (Intel openCL sdk)

CC007 avatar Jan 02 '17 20:01 CC007

Ok, now that that is installed, it seems that there are types used that don't come from openCL

CC007 avatar Jan 02 '17 21:01 CC007

On my macbook pro the following yields the same error.

clang++ -framework OpenCL longtst.cpp -o longtst
#include <iostream>
#ifdef __APPLE__
#include <opencl/opencl.h>
#else
#include <CL/opencl.h>
#endif

#define DATA_SIZE 1
#define LONG_DATA_SIZE DATA_SIZE*sizeof(cl_long)
int main(int argc, char **argv){
   long out[DATA_SIZE];
   out[0]=0L;

   // How many platforms are there ?
   cl_uint platformc = 0;
   clGetPlatformIDs(0, NULL, &platformc);

   if (platformc >0){
      // Extract a list of available platforms
      cl_platform_id *platforms = new cl_platform_id[platformc];
      clGetPlatformIDs(platformc, platforms, NULL);
   
      cl_device_id device_id=0;
      // loop through platforms until we have a valid GPU device 
      for (unsigned int i = 0; !device_id && i < platformc; ++i) {
         clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 1, &device_id, nullptr);
      }
      delete[] platforms;
   
      // only device_id context and command queue needed below
   
      if (device_id){
         cl_int err;

         // Create a context
         cl_context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

         // Create command queue for this context
         cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &err);

         // Here is our OpenCL kernel source for
         const char *source = 
            "__constant int start = 2147483647;\n"
            "static long calculate(int tc){\n"
            "    return (long)tc * 100;\n"
            "}\n"
            "__kernel void longtst(__global long *result){\n"
            "   int id = get_global_id(0);\n"
            "   result[id] = calculate(start + id);\n"
            "}\n";

         // Compile source 
         cl_program program = clCreateProgramWithSource(context, 1, (const char **) &source, NULL, &err);
         err = clBuildProgram(program, 1, &(device_id), NULL, NULL, NULL);

         // Extract and show any compile errors or warnings
         if (err != CL_SUCCESS){
            size_t len;
            err = clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &len);
            if (len >0){
              len++; // for '\0'
              char *compile_log = (char *) malloc(len);
              clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, len, (void *)compile_log, NULL);
              std::cerr <<"log{"<<std::endl<< compile_log << std::endl<<"}"<<std::endl;
              free (compile_log);
            }
         }

         // A program can have more than one kernel, select the kernel we want to call
         cl_kernel kernel = clCreateKernel(program, "longtst", &err);

         // Create buffers which 'wrap' the host data
         cl_mem outBuf = clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_WRITE_ONLY, LONG_DATA_SIZE, (void*)out, &err);

         // Set any kernel args
         err = clSetKernelArg(kernel, 0 , sizeof(cl_mem), &outBuf);

         // An event list helps us dispatch efficiently
#define EVENTS 2
         cl_event *events = new cl_event[EVENTS];

         // Decide how to partition the execution (we choose 1 group 1 threads)
         size_t globalRange = DATA_SIZE;
         size_t localRange = DATA_SIZE;

         // Enqueue the execution
         err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalRange, &localRange, 0, NULL, &events[1]);
   
         // Enqueue a read of 'out' data to the command queue
         err = clEnqueueReadBuffer(command_queue, outBuf, CL_TRUE, 0, LONG_DATA_SIZE, out, 1, &events[1], &events[0]);  

         // Wait for all the dispatches to complete.
         err = clWaitForEvents(EVENTS, events);
   
         // Release and delete the events 
         for (int i=0; i<EVENTS; i++){
            err = clReleaseEvent(events[i]);
         }
         delete[] events;

         // Release mem objects
         clReleaseMemObject(outBuf);

         // Release Kernel
         clReleaseKernel(kernel);

         // Release Program
         clReleaseProgram(program);

         // Release Context
         clReleaseContext(context);

         // Release Command Queue
         clReleaseCommandQueue(command_queue);

         // Note that we don't releaese any type ending in _id
   
         std::cout << out[0] << std::endl;
      }
   }
}

grfrost avatar Jan 02 '17 21:01 grfrost

BTW If I switch to use CPU device the above code yields the correct result. So yes I think there is an OpenCL runtime GPU issue here

Retina 15 inch 2.8 Ghz Intel I7, Intel Iris Pro Graphics.

grfrost avatar Jan 02 '17 21:01 grfrost

I tried to compile your example using: g++ -lOpenCL -o longtst main.cpp -I"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include" -std=c++11 -L"P:\Program Files (x86)\Intel\OpenCL SDK\6.3\lib\x64" and got the following errors:

||=== Build: Debug in OpenCLDemo (compiler: GNU GCC Compiler) ===|
main.cpp||In function 'int main(int, char**)':|
main.cpp|40|warning: '_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)' is deprecated (declared at P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include/CL/cl.h:1428) [-Wdeprecated-declarations]|
main.cpp|40|warning: '_cl_command_queue* clCreateCommandQueue(cl_context, cl_device_id, cl_command_queue_properties, cl_int*)' is deprecated (declared at P:\Program Files (x86)\Intel\OpenCL SDK\6.3\include/CL/cl.h:1428) [-Wdeprecated-declarations]|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetPlatformIDs@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetPlatformIDs@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetDeviceIDs@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateContext@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateCommandQueue@20'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateProgramWithSource@20'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clBuildProgram@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetProgramBuildInfo@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clGetProgramBuildInfo@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateKernel@12'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clCreateBuffer@24'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clSetKernelArg@16'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clEnqueueNDRangeKernel@36'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clEnqueueReadBuffer@36'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clWaitForEvents@8'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseEvent@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseMemObject@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseKernel@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseProgram@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseContext@4'|
C:\Users\Rik\AppData\Local\Temp\ccMhHL4D.o:main.cpp|| undefined reference to `clReleaseCommandQueue@4'|
||=== Build failed: 21 error(s), 2 warning(s) (0 minute(s), 0 second(s)) ===|

CC007 avatar Jan 02 '17 22:01 CC007