at_server icon indicating copy to clipboard operation
at_server copied to clipboard

Server can lock up during a large initial monitor response

Open gkc opened this issue 3 years ago • 2 comments

Describe the bug

  • Code in MonitorVerbHandler.processVerb will hold onto the event loop until it completes. In turn this means that, if the monitor: request results in a large number of notifications being sent to the client in the initial monitor response, the server becomes completely unresponsive to other requests.
  • Code snippet from MonitorVerbHandler.processVerb:
            var receivedNotifications =
                await _getReceivedNotificationsAfterEpoch(fromEpochMillis);
            for (var notification in receivedNotifications) {
              processReceivedNotification(notification);
            }
    
  • Primary problem is that processReceivedNotification is not async, it calls no awaits, so if the receivedNotifications list is large, no other event will be handled until the for loop completes.

There are additional issues that should be cleaned up while fixing this bug

  • If the receivedNotifications list is very large, then it could cause the server process to use a lot of memory
  • This code currently first of all gets all matching notifications, then sends them. For a large number of notifications, this will introduces noticeable latency before notifications start to be sent to the client
  • Lastly - currently we first of all register the notification callback, and then start sending the matched notifications.
    • Let's consider the following scenario:
      1. Client sends monitor request
      2. Server registers notification callback
      3. Server sends matched notifications to the initial response
    • Now imagine that step 3 takes a long time, and while it is being executed, a notification arrives that should be sent to the client
    • But if we allow other events to get a look-in, then we could have a notification arrive and get sent to the client, and it will be out-of-order because the server is still sending older matching notifications
    • With the current code, this is not actually possible, since the event loop is being hogged by the initial monitor response

To Reproduce Steps to reproduce the behavior:

  1. Have @bob send 10,000 notifications to @alice
  2. Have alice client send monitor:0 to @alice secondary
  3. While the 10,000 notifications are being returned, the server will not respond to any other commands.

Expected behavior

  • MonitorVerbHandler.processVerb needs to allow another event to get a look-in. (Easy)
  • We implement the fix in such a way as to ensure that out-of-order notifications can not happen (Quite difficult)

gkc avatar Oct 02 '22 18:10 gkc

Adding @purnimavenkatasubbu to help develop a test suite for this scenario detailed in the ticket

VJag avatar Oct 03 '22 12:10 VJag

No impl progress during PR47 due to lots of other unplanned work. Moving to PR48. Reducing estimate to 5SP as can reuse several aspects of the fsync implementation.

gkc avatar Oct 17 '22 12:10 gkc