dtn7-gold icon indicating copy to clipboard operation
dtn7-gold copied to clipboard

Fix deadlock in CLA manager

Open CryptoCopter opened this issue 3 years ago • 3 comments

While using dtn7-go in a large-ish setup, we found that the daemon would regularly lock up completely. We eventually traced the problem to the interaction between the cla-manager, manager-elem and cla itself (in our case mtcp, but the issue would potentially exist for any cla with synchronisation problems).

In the end it comes down to the fact that both unbuffered channel writes and the respective Close()-methods of the manager-elem and cla are blocking. The created a situation where the manager would call the manager-elem's Close()-method and block, while the manager-elem could not react to it, since it was blocked by a channel send to the manager which would also never resolve - thus a global deadlock.

Out first-order solution to this issue was to make the channel-writes for the reporting channels non-blocking, using a select-guard. In order to prevent message loss if there is not currently someone listening on the channel, we turned the channels from unbuffered to buffered channels (sizer of the buffer configurable using the ReportChannelBuffer const in manager.go.

C&C welcome.

CryptoCopter avatar May 22 '21 18:05 CryptoCopter

First, thanks a lot for all the work you have put into this.

However, the tests are failing with your changes. I even re-run them, but for both Go 1.13 as well as Go 1.16 the following happens:

--- FAIL: TestImplNetwork (289.52s)
    --- FAIL: TestImplNetwork/WebSocket-64-1-1024 (28.06s)
        impl_test.go:190: listener received 44 messages instead of 64

oxzi avatar May 24 '21 18:05 oxzi

The error occurs only for the last testcase {"WebSocket", 64, 1, 1024, mkWsListener, mkWsClient}, when you comment that one out, the test passes (at least it did so on my machine...).

As to why this might be happening: in the test logs you can find many instances of this error message. outChnl full or no consumer. So, the bundle is received and passed to the Manager, but when it tries to write the event to its outChnl, the channel is full. This is probably due to the goroutine which is spawned as part of handleListener being unable to keep up with processing the events coming in.

If you increase the buffer size from 100 to 200, the test now passes, which also points to it being an issue of excessive load.

As to why this is happening for the test case with 64 WebSocket CLAs but not for 64 TCP CLAs, I honestly have no idea...

CryptoCopter avatar May 27 '21 16:05 CryptoCopter

Please fix/merge this ... this great library is basically useless without this patch

pulsejet avatar Aug 29 '21 12:08 pulsejet