ROCm-OpenCL-Runtime icon indicating copy to clipboard operation
ROCm-OpenCL-Runtime copied to clipboard

Dead Code Elimanation incorrectly optimises away dependend code

Open Dantali0n opened this issue 4 years ago • 11 comments

Hello I am writing an FFT algorithm in OpenCL and have found a pretty nasty bug in the ROCm OpenCL implementation. The problem resolves around the following kernel it's l2 variable:

void kernel fft(global double *real, global double *imag, ulong size, ulong power) {
	double c1 = -1.0;
	double c2 = 0.0;
	long l2 = 1;

	for (uint l = 0; l < power; l++) {
		uint l1 = l2;
		l2 <<= 1;
		double u1 = 1.0;
		double u2 = 0.0;

		for (uint j = 0; j < l1; j++) {
			for (uint i = j; i < size; i += l2) {
				uint i1 = i + l1;
				double t1 = u1 * real[i1] - u2 * imag[i1];
				double t2 = u1 * imag[i1] + u2 * real[i1];

				real[i1] = real[i] - t1;
				imag[i1] = imag[i] - t2;
				real[i] += t1;
				imag[i] += t2;
			}
			double z = ((u1 * c1) - (u2 * c2));
			u2 = ((u1 * c2) + (u2 * c1));
			u1 = z;
		}

		double onecm = 1.0 - c1;
		double onecp = 1.0 + c1;
		c2 = sqrt(onecm / 2.0);
		c1 = sqrt(onecp / 2.0);

		c2 = -c2;	
	}
}

This kernel is launched using a simple global range of 1. So no parallelism at all, single CU, single SE, single wavefront. However, the above kernel produces incorrect results.

I know for sure this is an optimization bug as forcefully printing l2 during execution makes the kernel produce correct results. Furthermore, adding -cl-opt-disable to the build program options also resolves the issue!

...
for (uint l = 0; l < power; l++) {
	uint l1 = l2;
	l2 <<= 1;
	printf("l2: %u\n", l2);
	double u1 = 1.0;
	double u2 = 0.0;
...

Once again, this can not be due to concurrency issues as the kernel is launched with

this->cl_queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, cl::NDRange(1), cl::NullRange);

Settings -WB, -simplifycfg-sink-common=0 as mentioned in the DarkTable issue does not resolve the issue. Setting the optimization to anything above -O0 will produce incorrect results.

Dantali0n avatar May 17 '20 07:05 Dantali0n

does it work if replace double to float ?

qishilu avatar May 18 '20 10:05 qishilu

It has been over three months any update on this?

Dantali0n avatar Aug 25 '20 12:08 Dantali0n

Sorry for the delay. I wasn't expecting compiler concerns to be reported here. Can you provide the sources for the kernel and a standalone app which drives the kernel and checks that the result is as expected?

b-sumner avatar Aug 25 '20 15:08 b-sumner

Sorry for the delay. I wasn't expecting compiler concerns to be reported here. Can you provide the sources for the kernel and a standalone app which drives the kernel and checks that the result is as expected?

This project provides the ard-ocl target for which the source can be found in the oclfft folder. Several test cases for ard-ocl are included in the tests folder which uses boost to provide a unit test framework. The FFT function shown in a previous comment on this issue is used but produces incorrect results when compared against FFTW. The kernel is launched sequentially I.E. with a dimension of 1. When the kernel code is run on the CPU instead of using ROCM and OpenCL the results are correct.

FFTW, boost and cmake are required to run the standalone app.

perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b.tar.gz

Dantali0n avatar Sep 01 '20 07:09 Dantali0n

@b-sumner Hello, it has been another three months. I have provided the isolated app with test cases to compare FFTW and the before mentioned kernel on the 1st of September 2020. Could you please try it and confirm the optimization bug? Please note that the kernel works with -O0 and does not with -O1 and above hence indicating it is an optimization bug.

Dantali0n avatar Dec 01 '20 12:12 Dantali0n

Can someone else please look at this @vsytch @JasonTTang ???

Dantali0n avatar Dec 18 '20 11:12 Dantali0n

This is a compiler issue, not runtime. Could you report your problem here https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues ? It might help to get more attention.

gandryey avatar Dec 18 '20 15:12 gandryey

This is a compiler issue, not runtime. Could you report your problem here https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues ? It might help to get more attention.

I will try, honestly I have given up all hope of ever getting this fixed

Dantali0n avatar Dec 18 '20 18:12 Dantali0n

I downloaded the link and installed everything needed to build. But the build doesn't work because kernel.sh is not found, but cmake expects it: CMakeLists.txt: COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/kernel.sh

b-sumner avatar Jan 06 '21 19:01 b-sumner

Ah I see, yes it was quite a while since I made this example for the issue. I have fixed the compilation issues now.

perf-engineering-project-ard-seq.zip

Dantali0n avatar Jan 07 '21 08:01 Dantali0n

Well, the build went further, but... [ 87%] Building CXX object tests/CMakeFiles/testaocl.dir//oclfft/ard-ocl/src/ard-ocl.cxx.o ... In file included from /perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b/oclfft/ard-ocl/src/ard-ocl.cxx:1:0: /perf-engineering-project-3d31331f3aa00dc5d800af6e2b2210fcf104234b/oclfft/ard-ocl/include/ard-ocl.hpp:12:10: fatal error: CL/cl2.hpp: No such file or directory #include <CL/cl2.hpp> ^~~~~~~~~~~~ compilation terminated. tests/CMakeFiles/testaocl.dir/build.make:85: recipe for target 'tests/CMakeFiles/testaocl.dir//oclfft/ard-ocl/src/ard-ocl.cxx.o' failed

Unlike several other compile commands, the one for this file did not include the "-isystem /path/to/opencl/headers"

Is this a cmake issue? I have cmake version 3.18.1

b-sumner avatar Jan 08 '21 20:01 b-sumner