llvm [SYCL] Fix post-commit failure

May 06 '24 08:05 keyradical

@konradkusiak97 , why can't this be tested locally? Our CI has limited resources, we wouldn't want the entire development be done in CI. Logs for post-commit CI tasks contain options passed to buildbot/configure.py and that should provide you enough information to reproduce this locally.

May 08 '24 21:05 aelovikov-intel

@konradkusiak97 , why can't this be tested locally? Our CI has limited resources, we wouldn't want the entire development be done in CI. Logs for post-commit CI tasks contain options passed to buildbot/configure.py and that should provide you enough information to reproduce this locally.

@aelovikov-intel, one of the previous failures was only appearing in the post-commit CI and wasn't reproducible locally which is why I was trying to get some insight into the problem here. But I understand your concern about the limited resources of the CI. I'll close this PR and try to work on the fix locally.

May 08 '24 21:05 keyradical

@konradkusiak97 , why can't this be tested locally? Our CI has limited resources, we wouldn't want the entire development be done in CI. Logs for post-commit CI tasks contain options passed to buildbot/configure.py and that should provide you enough information to reproduce this locally.

@aelovikov-intel, one of the previous failures was only appearing in the post-commit CI and wasn't reproducible locally which is why I was trying to get some insight into the problem here. But I understand your concern about the limited resources of the CI. I'll close this PR and try to work on the fix locally.

I don't mind using CI this way if the failure is specific to the HW we use in CI. Otherwise, I'd be happy to help you reproduce locally or share a way to trigger post-commit only (without running pre-commit in addition to post).

May 08 '24 21:05 aelovikov-intel

@konradkusiak97 , why can't this be tested locally? Our CI has limited resources, we wouldn't want the entire development be done in CI. Logs for post-commit CI tasks contain options passed to buildbot/configure.py and that should provide you enough information to reproduce this locally.

@aelovikov-intel, one of the previous failures was only appearing in the post-commit CI and wasn't reproducible locally which is why I was trying to get some insight into the problem here. But I understand your concern about the limited resources of the CI. I'll close this PR and try to work on the fix locally.

I don't mind using CI this way if the failure is specific to the HW we use in CI. Otherwise, I'd be happy to help you reproduce locally or share a way to trigger post-commit only (without running pre-commit in addition to post).

That sounds great, thanks. I'll get back to you if my efforts to reproduce this locally will turn out to be fully unsuccessful and I'll need further help.

May 08 '24 21:05 keyradical

I didn't manage to reproduce this failure on a different Intel gpus that I have access to. It seems it is specific to Iris(R) Xe Graphics unless something different is going on.

For that latest post-commit run, I modified test-e2e/out_of_order_queue_status.cpp to use Q.memset() instead of Q.fill() and the test fails again, with the same error message as after merging https://github.com/intel/llvm/pull/12702 which means there might be a bug in the current memset implementation.

I've noticed that compared to the same test passing on the pre-commit run, the DPC++ is built with clang/clang++ on the post-commit which might be the reason for the difference. I tried reproducing it running on:

[level_zero:gpu] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.29138]

with this configuration:

export CXX=clang++
export CC=clang

python ../llvm/buildbot/configure.py -t Release --ci-defaults --shared-libs --no-assertions \
 --cmake-opt="-DSYCL_ENABLE_STACK_PRINTING=ON" \
 --cmake-opt="-DSYCL_LIB_WITH_DEBUG_SYMBOL=ON" \
 --cmake-opt="-DLLVM_INSTALL_UTILS=ON" \
 --cmake-opt="-DNATIVECPU_USE_OCK=Off" \
 --cmake-opt="-DSYCL_PI_TESTS=OFF" \
-o ./

and running the test with:

clang++ -fsycl -fsycl-targets=spir64 out_of_order_queue_status_memset.cpp -o res
env ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./res

@aelovikov-intel do you think it might be a bug specific to Iris(R) Xe Graphics? If so, I'll file a bug report.

May 10 '24 11:05 keyradical

@aelovikov-intel do you think it might be a bug specific to Iris(R) Xe Graphics? If so, I'll file a bug report.

Not sure I understand this. Who/what component do you want to file that bug against?

May 13 '24 16:05 aelovikov-intel

@aelovikov-intel do you think it might be a bug specific to Iris(R) Xe Graphics? If so, I'll file a bug report.

Not sure I understand this. Who/what component do you want to file that bug against?

The test which I added in this PR: out_of_order_queue_status_memset.cpp uses Q.memset instead of Q.fill and produces the same failure as after merging https://github.com/intel/llvm/pull/12702.

So the faulty component would be the current implementation of memset. The failure is specific to the Level-zero backend on Iris(R) Xe Graphics and DPC++ built with clang.

May 13 '24 17:05 keyradical