aparapi icon indicating copy to clipboard operation
aparapi copied to clipboard

Java Alternative Algorithm does not work for arbitrary NDRanges

Open Helios85 opened this issue 6 years ago • 13 comments

Helios85 avatar Nov 27 '18 20:11 Helios85

Aparapi "falls back" to a "Java Alternative Algorithm" (JAA) in cases where the Java byte code cannot be translated to OpenCL code. The JAA, however, only works if the NDRange is an exact power of 2. The following example illustrates the problem:


public class JAATest {
	public static void main(String[] args) {
		int[] a = {1, 2, 3, 4, 5, 6, 7};
		
		new Kernel() {
			@Override public void run() {
				System.out.println(); // induces fallback
				a[getGlobalId()] = 0;
			}
		}.execute(a.length);
		
		System.out.println(java.util.Arrays.toString(a));
	}
}

Execution yields the following incorrect output:

[0, 0, 0, 0, 5, 6, 7]

Helios85 avatar Nov 27 '18 20:11 Helios85

It seems to be multiples of 4, not powers of 2. Anything below 4 is 0, below 16 is 12.

NorbiPeti avatar Mar 06 '19 12:03 NorbiPeti

Hmmm that is interesting. I will have to investigate this issue again and see if that provides any new clues.

On Wed, Mar 6, 2019 at 1:04 PM NorbiPeti [email protected] wrote:

It seems to be multiples of 4, not powers of 2. Anything below 4 is 0, below 16 is 12.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Syncleus/aparapi/issues/142#issuecomment-470082868, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5JAscx41TgYJOkMnzY71hvzbJYO-Bqks5vT67VgaJpZM4Y2SiB .

freemo avatar Mar 06 '19 13:03 freemo

Same problem here. If NDRange is a multiple of available threads (12 in my case), all is fine. If not, then the results are (partially) incorrect. Tested by fiddling with ArrayTest (changing SIZE constant and forcing Java fall-back by adding a println of current id). Edit: validated a bit less hacky by explicitly setting THREAD_POOL as the preferred device using KernelManager.

rkraneis avatar Jul 04 '20 09:07 rkraneis

SIZE = 6 (less than number of threads)

Running com.aparapi.runtime.ArrayTest
expected[0]: [0, 1, 2, 3, 4, 5]
target[0]: [99, 99, 99, 99, 99, 99]
expected[1]: [1, 2, 3, 4, 5, 6]
target[1]: [99, 99, 99, 99, 99, 99]
expected[2]: [2, 3, 4, 5, 6, 7]
target[2]: [99, 99, 99, 99, 99, 99]
expected[3]: [3, 4, 5, 6, 7, 8]
target[3]: [99, 99, 99, 99, 99, 99]
expected[4]: [4, 5, 6, 7, 8, 9]
target[4]: [99, 99, 99, 99, 99, 99]
expected[5]: [5, 6, 7, 8, 9, 10]
target[5]: [99, 99, 99, 99, 99, 99]

SIZE = 12 (number of threads)

Running com.aparapi.runtime.ArrayTest
expected[0]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
target[0]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
expected[1]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
target[1]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
expected[2]: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
target[2]: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
expected[3]: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
target[3]: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
expected[4]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
target[4]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
expected[5]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
target[5]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
expected[6]: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
target[6]: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
expected[7]: [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
target[7]: [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
expected[8]: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
target[8]: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
expected[9]: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
target[9]: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
expected[10]: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
target[10]: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
expected[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

SIZE = 13 (number of threads + 1)

...
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
expected[12]: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
target[12]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]

SIZE = 23 ( 2 * number of threads - 1)

...
expected[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
expected[12]: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]
target[12]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[13]: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]
target[13]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[14]: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]
target[14]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[15]: [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]
target[15]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[16]: [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]
target[16]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[17]: [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
target[17]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[18]: [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
target[18]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[19]: [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]
target[19]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[20]: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
target[20]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[21]: [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]
target[21]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[22]: [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]
target[22]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]

From this it seems as if the final non-full batch is not calculated at all (arrays are initialized with 99 in this test).

rkraneis avatar Jul 04 '20 09:07 rkraneis

Setting SEQUENTIAL as the preferred device would work as a fallback ...

rkraneis avatar Jul 04 '20 09:07 rkraneis

This seems to be boiling down to execute(int) using createRange which uses device information to then create a range global:23 local:(derived)12 ... When manually creating the range with a null device and yielding global:23 local:(derived)23 also the thread pool execution leads to correct results. So localSize0 will then be only 12 during thread pool creation in KernelRunner.

rkraneis avatar Jul 04 '20 10:07 rkraneis

This optimization(?)

if (_device == JavaDevice.THREAD_POOL) {
  withoutLocal.setLocalSize_0(Runtime.getRuntime().availableProcessors());
  withoutLocal.setLocalIsDerived(true);
  return withoutLocal;
} else 

is only happening for create, but not for create2D nor for create3D. Removing it makes everything work. But there probably was originally a reason for adding it ...

Edit: it also contradicts the javadoc, which reads

The groupsize will be chosen such that _localWidth > 0 && _localWidth < MAX_GROUP_SIZE && _globalWidth % _localWidth==0 is true

which is what the thread scheduling relies upon.

rkraneis avatar Jul 04 '20 11:07 rkraneis

@rkraneis Thanks for your work on finding the root cause. The reason for that code being there is that Java performance is likely to be best if the number of Java Thread Pool threads doesn't exceed the CPU available cores (eventually Hyper-Threading siblings are also included in the count which is undesirable for High Performance Computing workloads, anyway). So either this is removed, since it can violate the rule _globalWith % _localWidth == 0, or the logic must be modified so that it respects such logic. @freemo What is your opinion here?

CoreRasurae avatar Jul 09 '20 18:07 CoreRasurae

@Helios85 Can you confirm @rkraneis theory for the root cause of this bug?

CoreRasurae avatar Jul 09 '20 18:07 CoreRasurae

@CoreRasurae I can indeed confirm that the Java Alternative Algorithm only works if NDRange is a multiple of the available logical processors. I tested this with two machines, one with 4, the other with 8 logical processors. In the first case, NDRange had to be a multiple of 4 for the alternative algorithm to work correctly, in the second case NDRange had to be a multiple of 8. @rkraneis Congratulations for finding the cause of this problem!

Helios85 avatar Jul 09 '20 19:07 Helios85

@Helios85 Thanks for confirming this.

CoreRasurae avatar Jul 09 '20 21:07 CoreRasurae

Tested whether the bug is still there on my device using the provided class JAATest. It printed [1, 2, 3, 4, 5, 6, 7], so works just fine. Tested on Ryzen with surely not 7 kernels. Seems like the bug may be gone, verification would be good tough.

KaiGerdMueller avatar Aug 29 '23 11:08 KaiGerdMueller