clspv generates wrong output result for outpursortedeven algorithm
Dear clspv team,
We ran a outputSortedEven algorithm using clspv and it seems like there are issues with clspv .
Algorithm explanation: input: array of integers output: sorted array (contains only even numbers) Explanation: We're trying to sort the array in ascending order and all the elements in output array must be even numbers, so for example A=[8, 2, 7, 1, 4] is input array then output array will be Output=[0, 2, 4, 6, 8], after sorting the array we're converting all the odd elements of the array into even by subtracting 1 from them. Also we are trying to store the indexes after sorting the array.
original outputSortedEven : https://godbolt.org/z/xs1q3se5v --> Wrong output modified outputSortedEven kernel with volatile added for is_repeating boolean variable : https://godbolt.org/z/a96M3beYs --> Correct output
Input: 8, 1, 3, 7, 11, 13 Correct Output: Output value array: 0, 2, 6, 8, 10, 12 Output index array: 1, 2, 3, 0, 4, 5
Output we are getting with clspv(1475) + llvm(134094): Output value array: 0, 2, 2147483647, 2147483647, 2147483647, 2147483647 Output index array: 1, 2, 2, 2, 2, 2
clspv top commit used by us: https://github.com/google/clspv/commit/0e20b2895e4b58e62311f5c311bfec4599f517e0 0e20b289 (HEAD) Fix SimplifyPointerBitcastPass hang on const GEP GVs (#1475) 0f171d76 Support multiple spirvop in the same kernel (#1477)
llvm src top commit used by: 7baa7edc00c5 (HEAD) [libclc]: clspv: add a dummy implementation for mul_hi (#134094) edc22c64e527 [X86] getFauxShuffleMask - only handle VTRUNC nodes with matching src/dst sizes (#134161)
The problem is llvm InstructionCombine and JumpThreading Pass are aggressively optimizing the is_repeating variable used in the algorithm.
Current observation: clspv O2 optimization level -> Issue clspv O1 optimization level -> issue clspv O0 optimization level -> No Issue clspv O2 optimization level + llvm InstructionCombine Pass Disable -> Issue clspv O2 optimization level + llvm InstructionCombine Pass + JumpThreading Pass Disabled -> No Issue
so our point is from clspv side do we need to call the llvm passes in a different order or we need to create a custom pass in clspv which will determine that this cl kernel doesn't need to go through InstructionCombine and JumpThreading passes.
If you want llvm dumps or anything more data please let me know.
The problem is llvm InstructionCombine and JumpThreading Pass are aggressively optimizing the is_repeating variable used in the algorithm.
Aggressive optimization does not imply potential breakage of the program behavior.
If a LLVM pass has a bug, we would need to make a unit test showing the issue and make a fix, or find help in the LLVM community to make a fix. But that would be quite surprising (while not impossible).
I think the most probable thing here would be an issue in one of clspv passes, leading LLVM to remove things that should not have been removed.
I have not found the issue yet. But while trying, I have modified the kernel to understand it better, and I ended up on a version not triggering the issue. Could you try it to see if it works on your side, and if it is doing the same algorithm?
static bool idx_already_sorted(uint in_i, uint out_i, global uint *out_idx) {
for (uint i = 0; i < out_i; ++i) {
if (in_i == out_idx[i])
return true;
}
return false;
}
__kernel void outputSortedEven(const uint vec_size, __global const uint *in_vec, __global uint *out_vec, __global uint *out_idx)
{
for (uint out_i = 0; out_i < vec_size; ++out_i)
{
uint min_idx = INT_MAX;
uint min_val = INT_MAX;
for (uint in_i = 0; in_i < vec_size; ++in_i)
{
uint val = in_vec[in_i];
val = val % 2 ? val - 1 : val;
if (val < min_val)
{
if (idx_already_sorted(in_i, out_i, out_idx))
continue;
min_val = val;
min_idx = in_i;
}
}
out_vec[out_i] = min_val;
out_idx[out_i] = min_idx;
}
}
@rjodinchr the code mentioned in https://github.com/google/clspv/issues/1486#issuecomment-2893322196 seems working fine
Input: 8, 1, 3, 7, 11, 13 Output Value: 0, Output Value Index: 1 Output Value: 2, Output Value Index: 2 Output Value: 6, Output Value Index: 3 Output Value: 8, Output Value Index: 0 Output Value: 10, Output Value Index: 4 Output Value: 12, Output Value Index: 5
But question is for the original cl kernel I shared why is_repeating getting optimized out.
But question is for the original cl kernel I shared why is_repeating getting optimized out.
I don't think is_repeating is getting optimized out. Yes when it is changed to volatile, it works. But the reason for that change to fix the code is much more complex. I'm still trying to figure it out, but it is not easy considering the complexity of the kernel (especially regarding the number of branches that are to be dealt with in a particular manner for GPU programming).
I'm keeping the bug open as long as we do not have found/fixed the issue, but I'm hoping to find a smaller reproducer because at the moment it is complicated to find the real bug with that one.
@rjodinchr I tried many small reproducer for the same algorithm but not able to reproduce the exact issue. the algorithm is almost similar like sorting an array. also algorithm like finding top or least k elements in an array but the lines of code and complexity will be similar. If you have some suggestion I can try please let me know.