pycuda potential bug in pycuda.gpuarray.vec

Thank you very much for this library. I have been playing around with it for a while and just found the following oddity. The following example produces the expected outcome.

import numpy
import pycuda.autoinit
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

multiply_them = pycuda.compiler.SourceModule("""
	__global__ void multiply_them(float *dest, float *a, float *b, const float3 test) {
		dest[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x] + test.x;
	}
""").get_function("multiply_them")

a = numpy.random.randn(10).astype(numpy.float32)
b = numpy.random.randn(10).astype(numpy.float32)
dest = numpy.zeros(10).astype(numpy.float32)

multiply_them(
	pycuda.driver.Out(dest),
	pycuda.driver.In(a),
	pycuda.driver.In(b),
	pycuda.gpuarray.vec.make_float3(1.0, 2.0, 3.0),
	block=(10, 1, 1), grid=(1, 1, 1)
)

print dest - a*b # [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

However, if a float4 is used instead of the float3, the output is different.

import numpy
import pycuda.autoinit
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

multiply_them = pycuda.compiler.SourceModule("""
	__global__ void multiply_them(float *dest, float *a, float *b, const float4 test) {
		dest[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x] + test.x;
	}
""").get_function("multiply_them")

a = numpy.random.randn(10).astype(numpy.float32)
b = numpy.random.randn(10).astype(numpy.float32)
dest = numpy.zeros(10).astype(numpy.float32)

multiply_them(
	pycuda.driver.Out(dest),
	pycuda.driver.In(a),
	pycuda.driver.In(b),
	pycuda.gpuarray.vec.make_float4(1.0, 2.0, 3.0, 4.0),
	block=(10, 1, 1), grid=(1, 1, 1)
)

print dest - a*b # [ 3.  3.  3.  3.  3.  3.  3.  3.  3.  3.]

The only difference between those example is in the following lines.

- 	__global__ void multiply_them(float *dest, float *a, float *b, const float3 test) {
+ 	__global__ void multiply_them(float *dest, float *a, float *b, const float4 test) {

- 	pycuda.gpuarray.vec.make_float3(1.0, 2.0, 3.0),
+ 	pycuda.gpuarray.vec.make_float4(1.0, 2.0, 3.0, 4.0),

Is my understanding wrong? Shouldn't the outcome be equal regardless of whether it is float3or float4?

Once again, thank you!

May 09 '17 04:05 sniklaus

Seems like an alignment problem. If you move test to the first position in the list (or add a dummy float* argument between b and test, or remove b), it works as expected:

import numpy
import pycuda.autoinit
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

multiply_them = pycuda.compiler.SourceModule("""
    __global__ void multiply_them(const float4 test, float *dest, float *a, float *b) {
        dest[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x] + test.x;
    }
""").get_function("multiply_them")

a = numpy.random.randn(10).astype(numpy.float32)
b = numpy.random.randn(10).astype(numpy.float32)
dest = numpy.zeros(10).astype(numpy.float32)

multiply_them(
    pycuda.gpuarray.vec.make_float4(1.0, 2.0, 3.0, 4.0),
    pycuda.driver.Out(dest),
    pycuda.driver.In(a),
    pycuda.driver.In(b),
    block=(10, 1, 1), grid=(1, 1, 1)
)

print(dest - a*b)

May 09 '17 05:05 fjarri

Yes, I assume that as well. My experience with CUDA is a little rusty, but shouldn't a programmer not have to worry about the alignment in such a case? I am aware that the alignment can become an issue if one tries to pass custom structs, but the example just uses default data types.

May 09 '17 05:05 sniklaus

The following example produces the expected outcome, supporting my assumption that there might be a bug in the implementation / usage of pycuda.gpuarray.vec.make_*4.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

__global__ void multiply_them(float *dest, float *a, float *b, const float4 test) {
    dest[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x] + test.x;
}
 
int main( int argc, char* argv[]) {
    float* h_dest = (float*) malloc(sizeof(float) * 10);
    float* h_a = (float*) malloc(sizeof(float) * 10);
    float* h_b = (float*) malloc(sizeof(float) * 10);

    srand(time(NULL));

    for (int i = 0; i < 10; i += 1) {
        h_a[i] = (rand() % 1000) / 1000.0;
        h_b[i] = (rand() % 1000) / 1000.0;
    }
 
    float* d_dest = NULL;
    float* d_a = NULL;
    float* d_b = NULL;
 
    cudaMalloc(&d_dest, sizeof(float) * 10);
    cudaMalloc(&d_a, sizeof(float) * 10);
    cudaMalloc(&d_b, sizeof(float) * 10);
 
    cudaMemcpy(d_a, h_a, sizeof(float) * 10, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, sizeof(float) * 10, cudaMemcpyHostToDevice);

    multiply_them<<<1, 10>>>(d_dest, d_a, d_b, make_float4(1.0, 2.0, 3.0, 4.0));
 
    cudaMemcpy(h_dest, d_dest, sizeof(float) * 10, cudaMemcpyDeviceToHost);
 
    for (int i = 0; i < 10; i += 1) {
        printf("%f\n", h_dest[i] - h_a[i]*h_b[i]);  // 1.000000
    }
 
    cudaFree(d_dest);
    cudaFree(d_a);
    cudaFree(d_b);
 
    free(h_dest);
    free(h_a);
    free(h_b);
 
    return 0;
}

May 09 '17 23:05 sniklaus

It may also be related to how PyCUDA passes the kernel parameters to the kernel. In the docs for cuLaunchKernel it says:

Kernel parameters to f can be specified in one of two ways:

Kernel parameters can be specified via kernelParams. If f has N parameters, then kernelParams needs to be an array of N pointers. Each of kernelParams[0] through kernelParams[N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.

Kernel parameters can also be packaged by the application into a single buffer that is passed in via the extra parameter. This places the burden on the application of knowing each kernel parameter's size and alignment/padding within the buffer. Here is an example of using the extra parameter in this manner:

As you can see in the sources, PyCUDA takes the second way. Perhaps something goes wrong there.

May 10 '17 01:05 fjarri

Excellent finding, thank you for pointing this out. This also explains why adding two dummy floats works by implicitly fixing the alignment.

import numpy
import pycuda.autoinit
import pycuda.driver
import pycuda.compiler
import pycuda.gpuarray

multiply_them = pycuda.compiler.SourceModule("""
	__global__ void multiply_them(float *dest, float *a, float *b, float dummy1, float dummy2, const float4 test) {
		dest[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x] + test.x;
	}
""").get_function("multiply_them")

a = numpy.random.randn(10).astype(numpy.float32)
b = numpy.random.randn(10).astype(numpy.float32)
dest = numpy.zeros(10).astype(numpy.float32)

multiply_them(
	pycuda.driver.Out(dest),
	pycuda.driver.In(a),
	pycuda.driver.In(b),
	numpy.float32(-1.0), # dummy1
	numpy.float32(-1.0), # dummy2
	pycuda.gpuarray.vec.make_float4(1.0, 2.0, 3.0, 4.0),
	block=(10, 1, 1), grid=(1, 1, 1)
)

print dest - a*b # [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

May 10 '17 02:05 sniklaus

pycuda pycuda copied to clipboard

potential bug in pycuda.gpuarray.vec

pycuda
pycuda copied to clipboard