chipStar icon indicating copy to clipboard operation
chipStar copied to clipboard

myocyte benchmark from HeCBench significantly slower with chipStar than with SYCL [LevelZero backend]

Open franz opened this issue 2 years ago • 10 comments

Without immediate queues, chipStar is ~100x slower, with immediate queues it is ~10x slower. My initial examination seems to point to many (possibly unnecessary) barrier commands, but anyway this needs to be investigated.

franz avatar Aug 25 '23 14:08 franz

You should run it through iprof to see if the kernels themselves aren't extra slow. Are atomics used?

On Fri, Aug 25, 2023 at 17:33 Michal Babej @.***> wrote:

Without immediate queues, chipStar is ~100x slower, with immediate queues it is ~10x slower. My initial examination seems to point to many (possibly unnecessary) barrier commands, but anyway this needs to be investigated.

— Reply to this email directly, view it on GitHub https://github.com/CHIP-SPV/chipStar/issues/599, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCJBQLY7HDCDJZTTYBN453XXCZTXANCNFSM6AAAAAA36TFTGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

pvelesko avatar Aug 25 '23 14:08 pvelesko

There's something fishy with the immediate command lists and the barriers. GAMESS fails with a lot of these errors flooded to log when I enable ICL:

CHIP error [TID 32150] [1692974685.965693710] : hipErrorTbd (ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:1351:enqueueBarrierImpl

CHIP error [TID 32150] [1692974685.975379722] : Caught Error: hipErrorTbd
CHIP error [TID 32150] [1692974685.977357058] : hipErrorTbd (ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:1351:enqueueBarrierImpl

This could be related to the reported issue here (suspected excessive barrier usage).

pjaaskel avatar Aug 25 '23 14:08 pjaaskel

What GPU is being used?

On Fri, Aug 25, 2023 at 17:50 Pekka Jääskeläinen @.***> wrote:

There's something fishy with the immediate command lists and the barriers. GAMESS fails with a lot of these errors flooded to log when I enable ICL:

CHIP error [TID 32150] [1692974685.965693710] : hipErrorTbd (ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:1351:enqueueBarrierImpl

CHIP error [TID 32150] [1692974685.975379722] : Caught Error: hipErrorTbd CHIP error [TID 32150] [1692974685.977357058] : hipErrorTbd (ZE_RESULT_ERROR_OUT_OF_DEVICE_MEMORY ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:1351:enqueueBarrierImpl

This could be related to the reported issue here (suspected excessive barrier usage).

— Reply to this email directly, view it on GitHub https://github.com/CHIP-SPV/chipStar/issues/599#issuecomment-1693490517, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCJBQIJNY73YDTOGCUIQN3XXC3RXANCNFSM6AAAAAA36TFTGQ . You are receiving this because you commented.Message ID: @.***>

pvelesko avatar Aug 25 '23 14:08 pvelesko

In my case, the iGPU, in Michal's a PVC.

pjaaskel avatar Aug 25 '23 14:08 pjaaskel

I opened a separate issue (#612) of the still occuring problem of mine above.

pjaaskel avatar Aug 31 '23 12:08 pjaaskel

Removing the barriers (+using event dependencies) significantly reduced the difference (to ~4x slower), but there was also a kernel problem - SYCL was using fast-math by default, and the kernels call pow/exp a lot, so SYCL was using native_pow / native_exp. Recompiling the SYCL without fast-math brought the difference down to 1.3x-1.4x.

franz avatar Sep 19 '23 12:09 franz

30-40% is still significant. Any clue what drags chipStar down still?

pjaaskel avatar Sep 19 '23 12:09 pjaaskel

@pjaaskel no, not yet.

franz avatar Sep 19 '23 13:09 franz

@franz

Do you mean the SYCL compiler enables fast math by default ? I checked Makefile and it does not have the fast math flag.

zjin-lcf avatar Sep 21 '23 22:09 zjin-lcf

Do you mean the SYCL compiler enables fast math by default ? I checked Makefile and it does not have the fast math flag.

This depends on the compiler. Intel compiler icpx sets fast math flag on (and also sets optimization level to -O2) by default while while GCC and Clang does not.

linehill avatar Sep 22 '23 05:09 linehill