tracy icon indicating copy to clipboard operation
tracy copied to clipboard

Group GPU child zones by name.

Open benvanik opened this issue 2 years ago • 9 comments

This gives a more intuitive grouping in this UI pane than the uniqued source locations which aren't fully displayed - people were thinking the UI pane was broken as identically named zones weren't grouping when the checkbox was ticked.

Previous: image

Now: image

benvanik avatar Sep 15 '21 23:09 benvanik

What is an unique source location in your use case? Is this duplication some artifact of optimization/inlining?

wolfpld avatar Sep 15 '21 23:09 wolfpld

These are GPU locations that use this approach: https://github.com/wolfpld/tracy/blob/master/TracyVulkan.hpp#L401 Because that allocs a new source location per use each srcloc ends up as a unique ID in the UI - so every single zone is uniqued. All of them have the same exact inputs (line/source/function/name) - I think a better fix would be if the tracy profiler deduplicated them instead (as ​then all parts of the UI would work correctly), but this is broken today and this felt like a good babystep to unblock being able to use tracy for GPU rollups :)

benvanik avatar Sep 15 '21 23:09 benvanik

(more to your question: we are running dynamic content that can't be compiled into the application and use the static source locations that are uniqued in the application using the static SourceLocationData)

benvanik avatar Sep 15 '21 23:09 benvanik

Because that allocs a new source location per use each srcloc ends up as a unique ID in the UI - so every single zone is uniqued. All of them have the same exact inputs (line/source/function/name) - I think a better fix would be if the tracy profiler deduplicated them instead (as ​then all parts of the UI would work correctly)

This should already be deduplicated. The GpuZoneBeginAllocSrcLocSerial message is being processed at https://github.com/wolfpld/tracy/blob/058e8901/client/TracyProfiler.cpp#L2364-L2371, where the source location payload is transferred within the SendSourceLocationPayload function (https://github.com/wolfpld/tracy/blob/058e8901/client/TracyProfiler.cpp#L2522-L2541). This data is eventually processed at https://github.com/wolfpld/tracy/blob/058e8901/server/TracyWorker.cpp#L3731-L3784, where a temporary source location structure is created, which is then used to search if an entry in the sourceLocationPayloadMap map already exists. If it does, the already existing source location is used. (The StoreString functions also deduplicate.)

It would be interesting to find out why and where this breaks.

wolfpld avatar Sep 16 '21 09:09 wolfpld

Awesome! I'll debug through that!

benvanik avatar Sep 16 '21 14:09 benvanik

Debugging through loading the file, I see TracyWorker.cpp 744 loading 1303 entries into sourceLocationPayload with 81 deduplicated entries in sourceLocationPayloadMap. So it looks like it is deduplicating there correctly on load, but the fact that there are 1303 payloads makes me think it's not deduplicating on capture correctly.

I believe the issue causing the behavior I'm seeing above in the zone view is that though that deduplication happens on load the GpuEvents cpuStart_srcloc is still pointing at the raw sourceLocationPayload index from the file - this means that the view code that's strictly using that index for its cmap will still be acting in the unduplicated domain.

The source locations I'm seeing are negative, probably coming from m_pendingSourceLocationPayload on the capture side. Debugging into that will be trickier (ugh, Android), but maybe the above triggers something in your mind? In particular, I'm wondering if namehash may be causing issues with the AddSourceLocationPayload deduplicating.

One particularly interesting thing I've observed is that this trace I'm debugging through was captured using the capture tool against a process running on an Android device, but if I capture a trace by directly attaching to a process on my Windows machine things work fine! Debugging through the same sourceLocationPayload loading code in a saved trace I see that both sourceLocationPayload and sourceLocationPayloadMap have the same number of entries (31 in my particular run) - indicating that the worker did deduplicate the payloads correctly.

Any thoughts as to why Android + capture tool would not deduplicate while Windows + profiler UI would? (I'm not sure where best to look at next)

Attached is the trace showing the unduplicated sourceLocationPayloads in case it's helpful: MobileNetV3Small-[fp32,imagenet]-(TensorFlow)-full-inference-with-IREE-Vulkan-@-SM-G980F-(GPU-Mali-G77).zip

(and thanks for the help!)

benvanik avatar Sep 16 '21 15:09 benvanik

So it looks like it is deduplicating there correctly on load

There is no deduplication at load time. Everything is done at capture time.

the GpuEvents cpuStart_srcloc is still pointing at the raw sourceLocationPayload index from the file - this means that the view code that's strictly using that index for its cmap will still be acting in the unduplicated domain.

This is the intended behavior.

The source locations I'm seeing are negative, probably coming from m_pendingSourceLocationPayload on the capture side.

This is because there are two ways of providing source location data, as you can see in the source location retrieval code:

const SourceLocation& Worker::GetSourceLocation( int16_t srcloc ) const
{
    if( srcloc < 0 )
    {
        return *m_data.sourceLocationPayload[-srcloc-1];
    }
    else
    {
        return m_data.sourceLocation.find( m_data.sourceLocationExpand[srcloc] )->second;
    }
}

Positive srcloc entries are decompressed (int16 -> uint64 LUT) to a pointer to the source location data structure, as was present in the profiled program. This pointer is then used to lookup the retrieved data stored in sourceLocation map.

Negative srcloc entries do not have a corresponding pointer in the profiled program memory, as they are allocated dynamically (and deduplicated by the methods referenced earlier), and have to be handled in a different manner.

You can see that the loading code just populates the sourceLocation, sourceLocationExpand and sourceLocationPayload lookup tables, without any further processing, as it's not needed.

In particular, I'm wondering if namehash may be causing issues with the AddSourceLocationPayload deduplicating.

namehash is used only as a hacky cache (mutable in a const struct...) for name string hash, used to provide zone color in certain configurations (View::GetRawSrcLocColor). The SourceLocationHasher and SourceLocationComparator functors, used for deduplication, operate on SourceLocationBase, which does not include the namehash field.

Any thoughts as to why Android + capture tool would not deduplicate while Windows + profiler UI would? (I'm not sure where best to look at next)

I don't think Android is a factor here, the allocated source location transfer format is rather simple, and not platform specific:

    // Allocated source location data layout:
    //  2b  payload size
    //  4b  color
    //  4b  source line
    //  fsz function name
    //  1b  null terminator
    //  ssz source file name
    //  1b  null terminator
    //  nsz zone name (optional)

Furthermore, both _36 source locations do have identical data inside:

obraz obraz

Profiler UI and the capture utility differ mainly by using or not the TRACY_NO_STATISTICS macro, which disables calculation of additional derivative data which may be useful in the UI, but which would just waste CPU cycles otherwise.

wolfpld avatar Sep 16 '21 19:09 wolfpld

Ok cool - that all makes sense. I apologize for being a bit dense, but why would if they have duplicate data and everything is working as intended are they not deduplicated and shown as unique entries in the GPU zone listing? (I don't know where to look next :)

benvanik avatar Sep 16 '21 20:09 benvanik

I have no idea. The best course of action would be to try to do a repro case and then debug the server part to find out what exactly happens in the deduplication code and why it is failing.

wolfpld avatar Sep 16 '21 20:09 wolfpld