gem5-resources icon indicating copy to clipboard operation
gem5-resources copied to clipboard

resources: Fix hipDeviceSynchronize calls

Open jebraun3 opened this issue 1 year ago • 26 comments

Corrected hipDeviceSynchronize calls to prevent kernels from exiting prematurely

Change-Id: I068b6324f33d0de905477d03efa195e3068dc7c5

jebraun3 avatar Dec 18 '23 22:12 jebraun3

@powerjg this would require creating a new binary. Are you ok with merging in anyways?

mattsinc avatar Feb 26 '24 22:02 mattsinc

We should probably make new binaries before it's merged. Otherwise, we could easily forget.

@Harshil2107 this is going to be on you :)

powerjg avatar Feb 27 '24 01:02 powerjg

Should these binaries be built on top of the changes in #22?

Harshil2107 avatar Feb 29 '24 23:02 Harshil2107

Should these binaries be built on top of the changes in #22?

Should these binaries be built on top of the changes in #22?

This is separate from #22 but if it's easier for you to update binaries once for both, that's fine with me

mattsinc avatar Mar 01 '24 05:03 mattsinc

Should these binaries be built on top of the changes in #22?

Just to clarify, #22 is only adding links to the prebuilts so as long as they are placed in the same link locations then no changes are required in #22.

abmerop avatar Mar 01 '24 14:03 abmerop

I guess I haven't reviewed yet...

What does "exiting prematurely" mean? Is gem5 exiting before the kernels finish? Is the goal here to intentionally serialize the kernels?

@jebraun3 can chime in with more details, but my recollection is that we were seeing gem5 essentially treat the kernel launches as blocking, causing multiple kernels from the same stream to overlap and thus get incorrect data. So we added the hipDeviceSynch's to prevent this.

mattsinc avatar Mar 01 '24 17:03 mattsinc

Is this PR ready to be merged?

Harshil2107 avatar Mar 08 '24 00:03 Harshil2107

I guess I haven't reviewed yet... What does "exiting prematurely" mean? Is gem5 exiting before the kernels finish? Is the goal here to intentionally serialize the kernels?

@jebraun3 can chime in with more details, but my recollection is that we were seeing gem5 essentially treat the kernel launches as blocking, causing multiple kernels from the same stream to overlap and thus get incorrect data. So we added the hipDeviceSynch's to prevent this.

@jebraun3 reminder to reply here to @abmerop , before this can/should be merged in

mattsinc avatar Mar 08 '24 06:03 mattsinc

@jebraun3, could you please let us know what is the status of this PR? Thank you.

ivanaamit avatar Mar 22 '24 15:03 ivanaamit

I guess I haven't reviewed yet... What does "exiting prematurely" mean? Is gem5 exiting before the kernels finish? Is the goal here to intentionally serialize the kernels?

@jebraun3 can chime in with more details, but my recollection is that we were seeing gem5 essentially treat the kernel launches as blocking, causing multiple kernels from the same stream to overlap and thus get incorrect data. So we added the hipDeviceSynch's to prevent this.

@jebraun3 reminder to reply here to @abmerop , before this can/should be merged in

@abmerop do you want more details from @jebraun3 or is my explanation above sufficient?

mattsinc avatar Mar 22 '24 16:03 mattsinc

My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github)

abmerop avatar Mar 22 '24 19:03 abmerop

My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github)

Why would it affect performance? I guess for the CPU part?

mattsinc avatar Mar 22 '24 19:03 mattsinc

My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github)

Why would it affect performance? I guess for the CPU part?

The sync is essentially a barrier which will prevent the new kernel dispatch packet from being placed in the AQL queue and therefore you would be paying an extra latency for sending the packet, ringing the doorbell, etc. which would otherwise be done while the previous kernel was running

abmerop avatar Mar 22 '24 19:03 abmerop

My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github)

Why would it affect performance? I guess for the CPU part?

The sync is essentially a barrier which will prevent the new kernel dispatch packet from being placed in the AQL queue and therefore you would be paying an extra latency for sending the packet, ringing the doorbell, etc. which would otherwise be done while the previous kernel was running

Right, so shaderActiveTicks wouldn't be affected (I think) but the sim_ticks would?

Regardless, @jebraun3 can you please add the details about what overlap you were seeing? Ultimately it likely indicates something is wrong with the barrier bit, which in theory makes the device synch's not needed ...

mattsinc avatar Mar 22 '24 19:03 mattsinc

I ran into this first with the Floyd-Warshall benchmark where running FW in gem5 would produce different results from running it on the actual hardware. From what I remember the result array that was produced left the vast majority of the array untouched, so most of its values were still set to their initialized values. Adding the synchs to the benchmark got gem5 to produce the same result array as the actual hardware.

jebraun3 avatar Mar 24 '24 01:03 jebraun3

Should this be merged or not? If I don't hear back from you until the end of the day today, I will merge it.

ivanaamit avatar Mar 28 '24 17:03 ivanaamit

Should this be merged or not? If I don't hear back from you until the end of the day today, I will merge it.

No, please do not merge until @jebraun3 , @abmerop , and myself come to a conclusion

mattsinc avatar Mar 28 '24 21:03 mattsinc

I now think it's OK to merge this to get the correct result.

If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result?

If that is the case we should circle back and figure out why the original code is different.

abmerop avatar Mar 29 '24 19:03 abmerop

I now think it's OK to merge this to get the correct result.

If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result?

If that is the case we should circle back and figure out why the original code is different.

I believe this is correct -- on the real GPU the code works and gets the correct answer, in gem5 it gets the incorrect answer without this fix. @jebraun3 please confirm.

I guess the question is: should we take this as a sign we need to investigate the barrier bit code again and try to fix it? Or should we patch this even with the shortcomings you highlighted and investigate later?

mattsinc avatar Mar 30 '24 04:03 mattsinc

I now think it's OK to merge this to get the correct result. If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result? If that is the case we should circle back and figure out why the original code is different.

I believe this is correct -- on the real GPU the code works and gets the correct answer, in gem5 it gets the incorrect answer without this fix. @jebraun3 please confirm.

I guess the question is: should we take this as a sign we need to investigate the barrier bit code again and try to fix it? Or should we patch this even with the shortcomings you highlighted and investigate later?

A barrier problem was my first thought. Specifically of the AND variety. And yes I think we should look into that.

abmerop avatar Mar 30 '24 15:03 abmerop

I now think it's OK to merge this to get the correct result. If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result? If that is the case we should circle back and figure out why the original code is different.

I believe this is correct -- on the real GPU the code works and gets the correct answer, in gem5 it gets the incorrect answer without this fix. @jebraun3 please confirm. I guess the question is: should we take this as a sign we need to investigate the barrier bit code again and try to fix it? Or should we patch this even with the shortcomings you highlighted and investigate later?

A barrier problem was my first thought. Specifically of the AND variety. And yes I think we should look into that.

Right, the question is: should we merge this and then fix (and revert later), or just not merge this and try to fix? My concern without merging is if people are using it now, they are not getting the correct answer. And it's unknown how long a fix would take (maybe long, maybe short).

mattsinc avatar Mar 30 '24 17:03 mattsinc

I did not see any problem when running in FS mode, so maybe this is only happening in SE mode? @jebraun3 what does the error look like? The application doesn't seem to print anything

abmerop avatar Apr 01 '24 14:04 abmerop

What @mattsinc said is correct, on the real GPU the code works and gets the correct answer but in gem5 it gets the incorrect answer. An error is only printed when FW is run with error checking, otherwise there is no indication of the error. When printed, the error should be a list of the locations of the mismatches followed by "WARNING: Produced incorrect results!".

jebraun3 avatar Apr 10 '24 22:04 jebraun3

What @mattsinc said is correct, on the real GPU the code works and gets the correct answer but in gem5 it gets the incorrect answer. An error is only printed when FW is run with error checking, otherwise there is no indication of the error. When printed, the error should be a list of the locations of the mismatches followed by "WARNING: Produced incorrect results!".

@jebraun3 It looks like FW's change is not in this commit, is that intentional? I was looking because I believe FW requires the user to manually change the output to check it, right?

mattsinc avatar Apr 11 '24 01:04 mattsinc

Hi, what is the status of this PR?

ivanaamit avatar May 16 '24 15:05 ivanaamit

Hi, what is the status of this PR?

Please just leave it open, I need to talk with @jebraun3 and @abmerop about it

mattsinc avatar May 17 '24 03:05 mattsinc

This is fairly ancient at this point, but looking into it, it seems the HIP runtime (well rocCLR) is smart enough to know if hipDeviceSynchronize is necessary or not for a given device.

Therefore my previous comment no longer applies and I am fine with this PR.

abmerop avatar Oct 20 '24 16:10 abmerop

When this is confirmed "good to go" can you please ping @Harshil2107 to rebuild these binaries and upload them? I want to make sure the source and the binaries are in-sync here.

BobbyRBruce avatar Oct 21 '24 02:10 BobbyRBruce

When this is confirmed "good to go" can you please ping @Harshil2107 to rebuild these binaries and upload them? I want to make sure the source and the binaries are in-sync here.

@jebraun3 is away on internship this semester and I'm guessing will not be able to reply too much. With the context @abmerop provided (thanks!) I think this patch is ok as is to go now, @BobbyRBruce , unless we want to switch it to be on stable instead of develop?

@jebraun3 it looks like we'll need a similar patch with the FW updates, but since you are away I'd rather we integrate this then wait more time for that one benchmark.

mattsinc avatar Oct 23 '24 03:10 mattsinc

Anything on the stable branch needs to be compatible with the current version of gem5 (that is, gem5 v24.0). develop is used to submit changes that should only be merged into stable upon the next release of gem5 (i.e., it requires some feature in the upcoming release and would fail on 24.0).

In short, go for stable if the resource works with 24.0.

BobbyRBruce avatar Oct 24 '24 04:10 BobbyRBruce