tracy Tracy does not stop captures right away

Tracy does not stop captures right away

Open Lectem opened this issue 2 years ago • 4 comments

When capturing with TRACY_ON_DEMAND, I would expect tracy to stop recording data when the server sends a "Stop" command (that is, when I click on )

However in my case, if I start my application with the server connected at boot, press the "Stop" button after 12s, Tracy will still capture zones information for a while (more than 2minutes, sometimes it won't even stop). I think this is because it's trying to catch up due to the query backlog, but new information (zones) keep being added as can be seen in the following screenshot:

One can see that stack sampling did stop at ~12s, but zones are still being recorded. If the server is too slow (note that right now this is using it on my local machine where both the client and server are running), then tracy might keep recording new data and quickly eating a lot of memory.

At first I thought this was because the Client still considers the server to be connected as it does the following checks:

#ifdef TRACY_ON_DEMAND
        if( !GetProfiler().IsConnected() ) return;
#endif

However I later realized that even when the Server is finally stabilized and really stops collecting data, it is still considered connected, so something else seems to be the issue here.

Is IsConnected() the right thing to check (since the server is still connected while retrieving data after clicking stop)?

From what I understand, when we click stop it sends a ServerQueryDisconnect. It seems that the issue is that (in my case) this query itself is queued, but Profiler::HandleServerQuery seems to only receive ServerQueryCallstackFrame queries. (Which are slow to answer, and either keep getting sent by the server or are just way many). In any case, while ServerQueryDisconnect is not received, the profiler keeps generating more data that is pumped by the server.

Do we need some kind of special query that disables zones and bypasses the queue ?Is that even possible or a good idea ? Or do you have a better idea?

Dec 08 '21 22:12 Lectem

Workaround available in 9b624049.

It seems that the issue is that (in my case) this query itself is queued, but Profiler::HandleServerQuery seems to only receive ServerQueryCallstackFrame queries. (Which are slow to answer, and either keep getting sent by the server or are just way many).

Recently call stack resolution was moved to a separate thread and this is not handled properly in the disconnect data handling logic.

Do we need some kind of special query that disables zones and bypasses the queue ?Is that even possible or a good idea ?

This is probably also a factor here. The solution is rather simple, as the send buffer size calculation logic already takes priority messages into account:

serverQuerySpaceLeft = std::min( ( m_sock.GetSendBufSize() / ServerQueryPacketSize ), 8*1024 ) - 4;   // leave space for terminate request

So, for example, the terminate request can skip the queue and be sent directly.

void Worker::QueryTerminate()
{
    ServerQueryPacket query { ServerQueryTerminate, 0, 0 };
    m_sock.Send( &query, ServerQueryPacketSize );
}

Dec 11 '21 12:12 wolfpld

The workaround does help stop the capture right away, thanks! However (as expected) a lot of strings names will be missing and the server will display ???.

This was discussed a bit on Discord, not all strings will get resolved under load (which somehow happens really easily at application startup due to stack sampling and resolution being slower than the amount of data generated by the application).

Here is the scenario that causes issue and what I could debug

Server queries information about callstacks and zones
- Server queries zone X information (pointers to strings)
- Server queues a lot more queries, including slow ones
Server queues ServerQueryDisconnect as user clicks "Stop"
Server will receive the zone information (late, due to huge amount of queries/slow stack resolution queries). Since ServerQueryDisconnect was sent, the server does not ask for the zone X strings (name, location, ...), even though it may have been issues way before user clicks "Stop".

So there are two issues at hand here:

We should tell the client to stop sending data about new zones as soon as possible to avoid overloading the queue
The server may need to have 3 states (connected / closing connection / stopped) instead of simply connected/stopped. I think this is more or less linked to the TRACY_ON_DEMAND mode only. While in the closing connection, the server would still continue to query any remaining string / stack resolution for zones/callstacks that arrived before the user clicks stop.

Dec 13 '21 08:12 Lectem

not all strings will get resolved under load (which somehow happens really easily at application startup due to stack sampling and resolution being slower than the amount of data generated by the application).

This should be fixed on master.

Apr 06 '22 10:04 wolfpld

not all strings will get resolved under load (which somehow happens really easily at application startup due to stack sampling and resolution being slower than the amount of data generated by the application).

This should be fixed on master.

It seems that it is now indeed resolving zones before callstacks!

Apr 11 '22 11:04 Lectem

tracy tracy copied to clipboard

Tracy does not stop captures right away

tracy
tracy copied to clipboard