libraft icon indicating copy to clipboard operation
libraft copied to clipboard

WireConvertor fails if frame length > 1400 bytes

Open allengeorge opened this issue 11 years ago • 2 comments

Apparently the default value used for the WireConverter (1400-byte max frame size) is too low, and causes the RaftAgents to fail as follows:

WARN  [2013-11-16 17:43:23,703] io.libraft.agent.rpc.FinalUpstreamHandler: SERVER_02: caught exception - closing channel to null
! org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 1400: 1428 - discarded
! at org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:417) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:405) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:370) ~[netty-3.6.6.Final.jar:na]
! at io.libraft.agent.rpc.WireConverter$Decoder.decode(WireConverter.java:65) ~[libraft-agent/:na]
! at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90) ~[netty-3.6.6.Final.jar:na]
! at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[netty-3.6.6.Final.jar:na]
! at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [na:1.6.0_65]
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [na:1.6.0_65]
! at java.lang.Thread.run(Thread.java:695) [na:1.6.0_65]

This stack trace describes a follower that is unable to parse a message from the leader. It's unclear to me why only one follower has this happening.

allengeorge avatar Nov 16 '13 17:11 allengeorge

This happened with only one server because I was doing a lot of testing with a cluster with 'f' failures. When SERVER_02 rejoined the cluster, the leader attempted to catch it up. Since many, many entries had to be placed into a single packet, this caused the packet size to expand well past the 1400-byte limit.

This points to a bigger (known) issue with RaftAglorithm: it does not chunk AppendEntries into 'packet-size' chunks. This is partly because it has no idea what the serialized size of the packet is going to be. I don't think it's a problem to be solved at its level: I think it's up to the network layer to chunk it and send it out.

allengeorge avatar Nov 16 '13 19:11 allengeorge

Currently I've mitigated this by changing the frame length to 10MB. This is a poor solution, and may point to failures in the interface design of RPCSender and RPCReceiver. Moreover, this requires a large number of copies to transfer data from one component into another, and out to the wire.

allengeorge avatar Nov 25 '13 18:11 allengeorge