TornadoVM
TornadoVM copied to clipboard
Different results of the same function using kernel interface depending on the LocalGroup size setting and input size.
Describe the bug Not sure if it is a bug or a openCL generated kernel feature we do not fully understand (i.e. like that the input array size must be even or power of 2 or what is the relation of GlobalGroup/LocalGroup sizes vs input size)
When the input size changes, or local size changes, our algorithm does not execute on all input data and yields wrong results (which is most of the time)
Using the same code as described in https://github.com/beehive-lab/TornadoVM/issues/157 we see that algorithm is not stable, ie. does not provide proper results on the all input - depending on the combination of the input params arrays size, GlobalWork size and LocalWork size) it seems it skips or overlaps certain amplitudes
array slots.
workerGrid.setGlobalWork(4096, 1, 1); // single dimension
int tileSize = 128;
tileSize = amplitudes.length<tileSize?amplitudes.length:tileSize;
workerGrid.setLocalWork(tileSize,1,1);
Will work for certain amplitudes.length but not for the others. From the results we see, sometimes it does not run for all values of amplitudes
array or tileSize
loop size based on the size of it or the LocalWork.
Setting workerGrid.setLocalWorkToNull();
does not help.
Could you clarify what should be relation of the LocalWork and the parallelized loop indexing as in the code we have?
Why not all values of amplitudes array are not filled?
We assume TornadoVM should chunk the work based on tileSize (LocalGroup size we use in the 2nd outer loop) - is it a false assumption?
public static void computeKernel(KernelContext context, int tileSize, final double infirstScanFrequency, final double frequencyStep, final double deltaEpsilon, final double[] normObsTimes, final double[] normObsValues, final double[] amplitudes, int[] localGroups)
{
int tileIdx = context.globalIdx;
int startFreqIdx = tileIdx * tileSize;
int nObservations = normObsTimes.length;
final double omega = (infirstScanFrequency + startFreqIdx * frequencyStep) * 2 * Math.PI;
// starting frequency
final double deltaOmega = frequencyStep * 2 * Math.PI;
// frequency step
final double[] sumSx = new double[constTileSize];
// sum of sin(obsTimes[i]*2pi*f)
final double[] sumCx = new double[constTileSize];
// sum of cos(obsTimes[i]*2pi*f)
final double[] sumSx2 = new double[constTileSize];
// sum of sin(obsTimes[i]*2pi*f)*sin(obsTimes[i]*2pi*f)
final double[] sumCx2 = new double[constTileSize];
// sum of cos(obsTimes[i]*2pi*f)*cos(obsTimes[i]*2pi*f)
final double[] sumSxCx = new double[constTileSize];
// sum of sin(obsTimes[i]*2pi*f)*cos(obsTimes[i]*2pi*f)
final double[] sumSxVal = new double[constTileSize];
// sum of sin(obsTimes[i]*2pi*f)*obsValue[i]
final double[] sumCxVal = new double[constTileSize];
// sum of cos(obsTimes[i]*2pi*f)*obsValue[i]
// for each set of observation data
for (int i = 0; i < nObservations; i++) {
final double obsTime = normObsTimes[i];
final double obsValue = normObsValues[i];
// calculate the starting phase and it's sine and cosine
final double phase = obsTime * omega;
double sPh = Math.sin(phase);
double cPh = Math.cos(phase);
// calculate the phase step and it's sine and cosine
final double dPhase = obsTime * deltaOmega;
final double sDPh = Math.sin(dPhase);
final double cDPh = Math.cos(dPhase);
// for each frequency to test, increment the phase with the phase step
for (int j = 0; j < tileSize; j++) {
sumSx[j] += sPh;
sumCx[j] += cPh;
sumSx2[j] += sPh*sPh;
sumCx2[j] += cPh*cPh;
sumSxCx[j] += cPh*sPh;
sumSxVal[j] += sPh*obsValue;
sumCxVal[j] += cPh*obsValue;
final double cT = cPh;
cPh = cT * cDPh - sPh * sDPh;
sPh = sPh * cDPh + cT * sDPh;
}
}
// We see amplitudes[startFreqIdx + i] is not filled fully depending on the LocalGroup/GlobalGroup/amplitudes.length size. Why?
// calculate intermediate variables and results
for (int i = 0; i < tileSize; i++) {
final double d = nObservations * ( sumCx2[i] * sumSx2[i] - sumSxCx[i] * sumSxCx[i])
- sumCx[i] * sumCx[i] * sumSx2[i] - sumSx[i] * sumSx[i] * sumCx2[i]
+ 2 * sumSx[i] * sumCx[i] * sumSxCx[i];
if (d < deltaEpsilon) {
amplitudes[startFreqIdx + i] = MAXRANGE;
} else {
final double b = sumCx[i] * sumSxVal[i] - sumSx[i] * sumCxVal[i];
final double c1 = nObservations * ( sumCxVal[i] * sumSx2[i] - sumSxVal[i] * sumSxCx[i] ) + sumSx[i] * b;
final double c2 = nObservations * ( sumSxVal[i] * sumCx2[i] - sumCxVal[i] * sumSxCx[i] ) - sumCx[i] * b;
amplitudes[startFreqIdx + i] = ( c1 * sumCxVal[i] + c2 * sumSxVal[i] ) / d;
}
}
}
executed as
WorkerGrid workerGrid = new WorkerGrid1D(nScanFrequencies);
GridScheduler gridScheduler = new GridScheduler("PeriodSearch.t0", workerGrid);
KernelContext context = new KernelContext();
// Set the global work size as we slice by frequencies
workerGrid.setGlobalWork(4096, 1, 1); // single dimension, is it hardware dependent?
// Set the local work group based on the max threadcount available
// we do not check for local memory here, but we should so the max global size < manually computed memory thresholds per tile x nProcessors allocated mem sizes in the kernel.
long nProcessors = TornadoRuntime.getTornadoRuntime().getDriver(0).getDevice(deviceNum).getDeviceMaxWorkgroupDimensions()[0];
// int tileSize = (int)( nScanFrequencies / nProcessors) ; // could we have a chunk size depending on the number of compute cores?
int tileSize = 128;
tileSize = amplitudes.length<tileSize?amplitudes.length:tileSize;
workerGrid.setLocalWork(tileSize,1,1);
TaskSchedule task1 = new TaskSchedule("PeriodSearch") //
.streamIn(amplitudes)
.task("t0", MethodLeastSquareGPU::computeKernel, context, tileSize , firstScanFrequency, frequencyStep, deltaEpsilon, normObsTimes, normObsValues, amplitudes, localGroup)
.streamOut(amplitudes,localGroup);
task1.execute(gridScheduler);
How To Reproduce Running the code above. Expected behavior Repeatable control over tileSize (LocalGroup size?) chunking value inside inner and outer loop of the function, as this affects results obtained.
Computing system setup (please complete the following information):
- OS: [e.g. Ubuntu 20]
- OpenCL Version 1.2,2,3
- TornadoVM 0.13-dev
Additional context Would appreciate any hints to pinpoint the algo misbehaviour, hopefully something trivial on our side..
When using the KernelContext
and the GridScheduler
APIs the developer is in full control of how to execute and deploy the application on the parallel hardware, similarly to OpenCL, CUDA or SYCL.
When setting the localWorkGroup
to null
, the corresponding driver will set up a block of threads for the user. However, if you are sharing memory (local memory in OpenCL, shared memory in CUDA/PTX), I guess it is better to control the amount the block size.
A few things:
- From your code, you are not sharing local memory, you are only using global memory and private memory, unless you mean something different with the last parameter array (
localGroups
) - Having a quick look to your code, it looks to me all threads (GPU threads) are running the same code with the same data sets. I guess this is not what you want (for example the first loop).
- TornadoVM has an aggressive JIT compiler and it specializes the input program using a sort of partial evaluation. If data sizes changes, you will need to recompile the program again before re-launching.
Are you porting this code from existing OpenCL/CUDA code? Note that with the KernelContext
API, the programmers must provide a parallel solution, so the programmer should think in a parallel implementation from the beginning. I wonder if the code you show works as expected in OpenCL.
If you prepare a test-case we can pull and run, I am happy to take a look.
Big Thanks for checking and for support on this, hopefully my understanding on GPU/OpenCL etc basics will improve fast, for now some assumptions might be plain wrong:
- Ad 1) I assumed that we'd have GlobalWork number of executions with the size of localGroup each, so in our case we'd have to define kernek work as
amplitudes.length/GlobalWork size
which should be localSize (tileSize
). - Ad 2) The first loop should be always the same (length of the timeseries that will change between the Java calls only), but the assumption was that inner loop should start at end for a different frequency window/slice/tile defined by
[globalId * localGroupSize,globalId * localGroupSize + 1)
for each execution of the kernel. This is a spectral window for which we want to run Least Square frequenygramme construction and due to the nature of algorithm, the only natural slicing/tiling without mathematically rewriting it (if even possible - I cannot see how at the moment). Then we want to fill in results for this specific spectrum window in amplitudes. Each index of amplitudes vector and private vectors are associated to a different frequency. - Ad 3) So is there any possibility to write alike functions so this recompilation is not needed as we'd have billions of calls to such functions?
This is our first attempt to port this particular algo to any GPU code. We had Deeming period search proof of concept implementation (similar to DFT but optimized for uneven data) in the past implemented directly in CUDA by a student nearly ten years ago, but since the whole project in Java based we would be eager to try with the one above (LSQ), esp with our private cluster having over 100s of Intel GPUs with possibility of cloud-bursting on CUDA as well.
I will try to extract the code and unit test, much appreciated if you could look. Will provide a visual explanation of results we expect as well.
I am closing this issue as I think it has been solved. Feel free to reopen it if needed.