omnitrace icon indicating copy to clipboard operation
omnitrace copied to clipboard

Slice has duration of "Did not end."

Open dwchang79 opened this issue 2 years ago • 15 comments

Some of my omnitrace proto files "did not end" according to Perfetto. For my work, I am trying to find the end time for certain kernels and when they do not end, it leads to a -1 run time. I heard this is a known issue, but I was told to formally submit this bug (and my other 2) so that they can be properly tracked.

I have attached a screenshot of the behavior. In it you should see the "samples [omnitrace]" slice continue and become white at the end/right and also near the bottom left you can see that the "Duration" says "Did not end."

Thank you. DidNotEnd

dwchang79 avatar Oct 02 '23 15:10 dwchang79

Likely fixed by buffer flushing fix in #317

jrmadsen avatar Jan 10 '24 13:01 jrmadsen

Please re-open if https://github.com/AMDResearch/omnitrace/pull/317 does not fix this issue in the upcoming release

jrmadsen avatar Jan 11 '24 01:01 jrmadsen

Unfortunately, it seems this bug is still in Omnitrace. When I try to load LLaMa2's perfetto file, all of the kernels do not end.

dwchang79 avatar Jan 23 '24 23:01 dwchang79

Also I can't seem to reopen this issue. Usually the re-open button is on the bottom and I do not see it.

dwchang79 avatar Jan 23 '24 23:01 dwchang79

How big is the perfetto file? Could you be hitting the data limit? Bc it’s strange the samples stop showing up. Samples are not inserted into perfetto until finalization but GPU kernels are so it would be strange (but maybe not impossible) for samples to cause the data limit to be hit and cause perfetto to drop the rest of the records.

jrmadsen avatar Jan 23 '24 23:01 jrmadsen

Wait, do the GPU kernels not end or do the samples not end? Bc if it’s just the samples, then that really seems like a data limit issue

jrmadsen avatar Jan 23 '24 23:01 jrmadsen

If the size of the perfetto buffer is the issue, you can either increase it (I think the default is maybe 2 GB) or you can disable Perfetto annotations (which will reduce the amount of data sent to Perfetto, sometimes very significantly)

jrmadsen avatar Jan 23 '24 23:01 jrmadsen

To answer the questions:

  1. The Perfetto file is ~900 MB so it does not open in the UI and I have to open it in the Desktop version.

  2. The GPU kernels also do not end. The top samples (or some times main function) has that white at the end, but when I go down into the actual kernels being launched, the reason it's that shade of white is because the last kernel(s) does not end. It has a start time but no end time.

  3. Is there a way to increase the size (as you suggested) using the Perfetto web UI?

Thank you.

dwchang79 avatar Jan 24 '24 00:01 dwchang79

I just checked and it looks like the default buffer limit is ~1 GB so it sounds you may be hitting it.

No, the buffer size has nothing to do with the web UI. There is nothing you can do about any existing perfetto files. You need to recollect data with OMNITRACE_PERFETTO_BUFFER_SIZE_KB set to a larger value and/or set OMNITRACE_PERFETTO_ANNOTATIONS to OFF

jrmadsen avatar Jan 24 '24 09:01 jrmadsen

We increased the buffer size to ~4 GB and unfortunately, the problem still persists. I've attached a screenshot showing the problem. perfetto_fp16_4GBbuff

dwchang79 avatar Jan 24 '24 22:01 dwchang79

@dwchang79 Internal ticket has been created to further investigate your issue. Thanks!

ppanchad-amd avatar Oct 07 '24 17:10 ppanchad-amd

Hi @dwchang79, are you still experiencing this issue? If so, do you have a simple way to reproduce it?

schung-amd avatar Oct 08 '24 13:10 schung-amd

Hi @dwchang79, are you still experiencing this issue? If so, do you have a simple way to reproduce it?

I am no longer at AMD (was on Sabbatical there as a Visiting Scholar), but I believe it is still an issue.

dwchang79 avatar Oct 08 '24 16:10 dwchang79

Thanks for the reply! Do you recall any details about when these issues occurred? Did they only occur for a specific workload? Did you see this consistently?

schung-amd avatar Oct 10 '24 14:10 schung-amd

Thanks for the reply! Do you recall any details about when these issues occurred? Did they only occur for a specific workload? Did you see this consistently?

When I first reported it, I was running CoralGEMM (don't remember if DGEMM or SGEMM), but later on it was LLaMa-2. And yes, I would see it consistently every run.

Thank you.

dwchang79 avatar Oct 10 '24 18:10 dwchang79

Closing for now as I can't seem to reproduce this. If we see future occurrences of this issue I'll circle back here to update.

schung-amd avatar Nov 14 '24 21:11 schung-amd