GameNetworkingSockets icon indicating copy to clipboard operation
GameNetworkingSockets copied to clipboard

Should I Ignore k_ESteamNetworkingSocketsDebugOutputType_Bug ?

Open Kodpact opened this issue 4 years ago • 9 comments

I run example chat program with many client connections at same time and when i was terminating all clients at once server was failing asserts with debug outputs with type of k_ESteamNetworkingSocketsDebugOutputType_Bug.

AssertMsg1( false, "Processed thinkers %d times -- probably one thinker keeps requesting an immediate wakeup call.", nIterations ); And AssertMsg( len( m_vecPendingCallbacks ) < 100, "Callbacks backing up and not being checked. Need to check them more frequently!" );

These two asserts were failing for me. Do i need to ignore them or should I terminate server ? Can i fix those issues ?

Kodpact avatar Sep 05 '21 11:09 Kodpact

If you can reproduce it, can you debug it? Just put a breakpoint on that assert. Then, instead of breaking the thinker loop, set next statement and let the straggler try to think one more time, and see who it is and what they are doing.

zpostfacto avatar Sep 05 '21 23:09 zpostfacto

(And BTW, that assert is probably an indication of a bug that is worth tracking down and fixing. If you are able to reproduce it and can give me some info, I'd appreciate it.)

zpostfacto avatar Sep 06 '21 00:09 zpostfacto

I am using latest release version (1.30) further updates may have fixed this. Note that i am connecting about 400 clients to server (in ~10 seconds) and then killing all of them which causes those assert failures (Quick off-topic question, i was able to connect only 63 by one process, im on windows, is it because of WaitForMultipleObjects which has count limit of MAXIMUM_WAIT_OBJECTS = 64? or im doing something wrong...). Seems like second bug message is about callback, its not anything dangerous, just queue is getting bigger and server cannot keep up, but nothing gets corrupted. After stack exceeds 100 then assertion fails, is 100 that big? maybe there should be option to set that value... But first one seems different, looks like the thinker enters deadlock (this condition fails), further debug shows that the Main Thread is the owner of the lock and SteamNetworking thread cannot obtain it. image I have no large picture of whats happening but i think code here has a bug, everytime I paused on breakpoint whenever nIterations was big number like 6K+ the size of queue s_queueThinkers was 1 (Looks obvious as i was connecting no more than 400 clients then how this reached 6K or 10K...), and it was entering the same path over and over because the call to pNextThinker->InternalSetNextThinkTime( usecNow + 1000 ); Was not removing the item from s_queueThinkers so the size was always 1 with infinite loop. So this line of code has no effect as its purpose was to delay the deadlock...

Here is the code of chat client/server which was modified to create spammer clients :/ Im spawning 63 clients per process and connect them to the server...

Kodpact avatar Sep 06 '21 12:09 Kodpact

The newest code only uses one wait object, regardless of the number of sockets, so the 63 socket limit should be fixed. (5b70535b71a8d33a1e883dea3f20269aebf66e5f)

Also, are you perhaps simulating fake lag? There was a bug with the fake lag code, but I think it might have been introduced and then fixed (d515f45d0bd6423ef142c8b973aa166ec7e03359) since 1.30.

If you reproduce a deadlock, can you see what the main thread is doing? Would love to see a call stack from the main branch.

I recently implemented a P2P stress test, because we had a partner making hundreds of P2P connections (which are more expensive than UDP connections, in general). It did something very similar to what you are doing -- made a bunch of connections at a relatively high rate and watched for any assertions or warnings about long lock times, etc. This led to some optimizations. I think most of the problems I fixed were specific to P2P code, but there might have been some general fixes. I would be interested to know if the problems you are seeing are fixed with the latest code in the main branch.

zpostfacto avatar Sep 06 '21 22:09 zpostfacto

Also, are you perhaps simulating fake lag? There was a bug with the fake lag code, but I think it might have been introduced and then fixed (d515f45) since 1.30.

No, thats not it (I tested latest version just to make sure...). Call to s_queueThinkers's item-remove function is only called when usecTargetThinkTime is set to k_nThinkTime_Never when calling InternalSetNextThinkTime. Which never happens when program enters into deadlock and then it keeps iterating infinitely until "infinite-loop defender assertion" - nIterations > 10000 is fired.

Kodpact avatar Sep 07 '21 00:09 Kodpact

OK, thanks. Well if you are able to tell me what the other threads is doing (call stack) I'll bet we can get it figured out. Also I can try to apply your changes and run your test. If you want to turn your change into a more explicit unit test (add some command line parameters or maybe even just an #ifdef that could be compiled to a separate executable) and submit a PR, then I can reproduce it and add it to the CI configuration. I understand if that's more work than you want to apply to this. But if you are interested, I'd definitely appreciate the effort. In any case I'll try to help get this fixed if you can give me some more info.

zpostfacto avatar Sep 07 '21 17:09 zpostfacto

This is what main thread is doing image I think you get it wrong. Problem is not in locking/unlocking! The lock is released after few ms from main thread so SteamNetworking can continue to work on it, problem is that the socket thread is waiting like while (true) and not like sleep(x) or "wait next server tick"

Comment clearly says what it should do: "Reschedule him for 1ms in the future", it just doesn't

Kodpact avatar Sep 07 '21 19:09 Kodpact

Just now getting back to this bug. I don't support you have a test harness or other script I can run to reproduce the problem?

zpostfacto avatar Oct 04 '21 18:10 zpostfacto

It's not easy to reproduce. Best I could come up with is this msvc project. Run server from VS and then run clients from batch script for about 10 times, and after successfull ~600 connections close all clients. Wait for few seconds for timeout and then it should happen. (GNS is installed from vcpkg)

project.zip

Kodpact avatar Oct 04 '21 20:10 Kodpact