CUDA-CNN Possible unnecessary __syncthreads using for kernel function g_getCost

Possible unnecessary __syncthreads using for kernel function g_getCost_3 in common/cuBase.cu

Open Lebronmydx opened this issue 6 years ago • 0 comments

Currently I am learning CUDA-CNN source code and may find some unnecessary use of __syncthreads() for a kernel funcitons g_getCost_3 in common/cuBase.cu For kernel function g_getCost_3,

__global__ void g_getCost_3(float* cost,
	float** weight,
	float lambda, int wlen)
{
	extern __shared__ float _sum[];
	_sum[threadIdx.x] = 0;
	__syncthreads();
	float* w = weight[blockIdx.x];

	for(int i = 0; i < wlen; i += blockDim.x)
	{
		int id = i + threadIdx.x;
		if(id < wlen)
		{
			_sum[threadIdx.x] += w[id] * w[id];
		}
	}
......
}

Meanwhile, g_getCost_3 will be called in project as below form:

g_getCost_3<<<dim3(w.size()), dim3(32), sizeof(float) * 32>>>(cost->getDev(),

So we can make sure, there are 32 threads in one block, which means each block will have one thread warp.

Before kernel function reaches the line 149, there is no branch divergence in the program, so that before line 149, all threads in the same thread warp are synchronized (ref: https://devtalk.nvidia.com/default/topic/632471/is-syncthreads-required-within-a-warp- ), because If there is no divergence in the WARP, all threads of the WARP will execute the same instruction at the same time, so, you don't need to synchronize at WARP level.

As a result , we can safely make a conclusion that when all threads in one block executing "float* w = weight[blockIdx.x];" at line 149, "_sum[threadIdx.x] = 0;" are all finished in each thread, so there should be unnecessary to synchronize threads at line 148?

Thanks very much ! :)

Aug 22 '18 13:08 Lebronmydx

CUDA-CNN CUDA-CNN copied to clipboard

Possible unnecessary __syncthreads using for kernel function g_getCost_3 in common/cuBase.cu

CUDA-CNN
CUDA-CNN copied to clipboard