[E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test.
Fixes https://github.com/intel/llvm/issues/12995 failure for cuda 12.4.
all_of_group calls vote.sync.all ptx instruction in the CUDA backend. It seems cuda 12.4 needs to have all members of the non-uniform ballot group in converged control flow to solve this failure.
From my understanding, this change shouldn't be necessary as per the cuda spec for sm_60 and above: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions
"For .target sm_6x or below, all threads in membermask must execute the same vote.sync instruction in convergence, and only threads belonging to some membermask can be active when the vote.sync instruction is executed. Otherwise, the behavior is undefined."
I think this is a cuda ptxas bug, but I'm adding a barrier here just so the test passes once we switch to cuda 12.4. This test already passes fine for cuda 12.3 and below. There is no difference in the ptx generated for cuda 12.4, so I think this must be a ptxas/sass issue. Note that strictly speaking we support sm_5x (which would require the barrier addition here anyway) but in reality these "Maxwell" cards are very rarely used because they don't have any data centre cards in this generation. We get asked about "Kepler" support sm_3x sometimes (that we don't officially support because it is below sm_50), but I don't ever remember a sm_5x request/issue.