Slice has duration of "Did not end."
Some of my omnitrace proto files "did not end" according to Perfetto. For my work, I am trying to find the end time for certain kernels and when they do not end, it leads to a -1 run time. I heard this is a known issue, but I was told to formally submit this bug (and my other 2) so that they can be properly tracked.
I have attached a screenshot of the behavior. In it you should see the "samples [omnitrace]" slice continue and become white at the end/right and also near the bottom left you can see that the "Duration" says "Did not end."
Thank you.
Likely fixed by buffer flushing fix in #317
Please re-open if https://github.com/AMDResearch/omnitrace/pull/317 does not fix this issue in the upcoming release
Unfortunately, it seems this bug is still in Omnitrace. When I try to load LLaMa2's perfetto file, all of the kernels do not end.
Also I can't seem to reopen this issue. Usually the re-open button is on the bottom and I do not see it.
How big is the perfetto file? Could you be hitting the data limit? Bc it’s strange the samples stop showing up. Samples are not inserted into perfetto until finalization but GPU kernels are so it would be strange (but maybe not impossible) for samples to cause the data limit to be hit and cause perfetto to drop the rest of the records.
Wait, do the GPU kernels not end or do the samples not end? Bc if it’s just the samples, then that really seems like a data limit issue
If the size of the perfetto buffer is the issue, you can either increase it (I think the default is maybe 2 GB) or you can disable Perfetto annotations (which will reduce the amount of data sent to Perfetto, sometimes very significantly)
To answer the questions:
-
The Perfetto file is ~900 MB so it does not open in the UI and I have to open it in the Desktop version.
-
The GPU kernels also do not end. The top samples (or some times main function) has that white at the end, but when I go down into the actual kernels being launched, the reason it's that shade of white is because the last kernel(s) does not end. It has a start time but no end time.
-
Is there a way to increase the size (as you suggested) using the Perfetto web UI?
Thank you.
I just checked and it looks like the default buffer limit is ~1 GB so it sounds you may be hitting it.
No, the buffer size has nothing to do with the web UI. There is nothing you can do about any existing perfetto files. You need to recollect data with OMNITRACE_PERFETTO_BUFFER_SIZE_KB set to a larger value and/or set OMNITRACE_PERFETTO_ANNOTATIONS to OFF
We increased the buffer size to ~4 GB and unfortunately, the problem still persists. I've attached a screenshot showing the problem.
@dwchang79 Internal ticket has been created to further investigate your issue. Thanks!
Hi @dwchang79, are you still experiencing this issue? If so, do you have a simple way to reproduce it?
Hi @dwchang79, are you still experiencing this issue? If so, do you have a simple way to reproduce it?
I am no longer at AMD (was on Sabbatical there as a Visiting Scholar), but I believe it is still an issue.
Thanks for the reply! Do you recall any details about when these issues occurred? Did they only occur for a specific workload? Did you see this consistently?
Thanks for the reply! Do you recall any details about when these issues occurred? Did they only occur for a specific workload? Did you see this consistently?
When I first reported it, I was running CoralGEMM (don't remember if DGEMM or SGEMM), but later on it was LLaMa-2. And yes, I would see it consistently every run.
Thank you.
Closing for now as I can't seem to reproduce this. If we see future occurrences of this issue I'll circle back here to update.