YACCLAB icon indicating copy to clipboard operation
YACCLAB copied to clipboard

Data-race in CUDA HA4 Implementation

Open N7Alpha opened this issue 1 year ago • 3 comments

Background

When processing large batches tall images (10000-100000 rows, 64 columns) it exposed a data-race which causes out-of-bounds reads in the labeling_hennequin_2018_HA4.cu algorithm implementation. This only occurs in every ~10-100 runs of the kernel. This implicit sync optimization is the cause. To fix it all you have to do is remove the #ifdef check, that is unconditionally call __syncthreads(). I used compute-sanitizer --racecheck to confirm the existence of this bug. In the data I was working with adding the synchronization lead to a 5% runtime performance regression to the StripLabeling() kernel. My machine was RHEL with an Ampere GPU running with CUDA 11.4. I can't give you example code to reproduce the issue since I'm not allowed to share the code.

Solution

I think the check is supposed to be __CUDA_ARCH__ >= 700 instead of __CUDA_ARCH__ < 700? From what I remember reading Nvidia has been removing implicit synchronization over time and not adding it. I could not name the exact feature the #ifdef block is trying to get at. Personally I would just comment out the #ifdef entirely and leave a comment explaining why. That's what I did in my own code. The specific optimization is mentioned in the original HA4 paper (I think) so adding an explainer at that point should dissuade the user from making the same mistake.

N7Alpha avatar Feb 01 '25 20:02 N7Alpha

@N7Alpha thank you for spotting this issue! Would you like to create a pull request to get appropriate credits for the fix?

prittt avatar Feb 13 '25 16:02 prittt

Sure I can do that. Do you happen to know if it’s just inverting the comparison operator? If not I’ll check go through the original paper and figure out what it should be 👍

N7Alpha avatar Feb 13 '25 16:02 N7Alpha

Hey @N7Alpha,

Do you happen to know if it’s just inverting the comparison operator?

Totally not sure. I thought I would have the time to check it out soon, but I was totally wrong 😞. Sorry for that.

prittt avatar Mar 19 '25 10:03 prittt