seastar
seastar copied to clipboard
allow setting buffer sizes on server_socket
We add two options to set the recv and send (SO_RCVBUF, ...) buffer sizes on a listening socket (server_socket). This is mostly useful to propagate said sizes to all sockets returned by accept().
It is already possible to set the socket option directly on the connected socket after it returned by accept() but experimentally this results in a socket with the specified buffer size, but whose receive window will not be advertised to the client beyond the default (64K for current typical kernel defaults). So you get only some of the benefit of the larger buffer.
Setting the buffer size on the listening socket however, is mentioned as the correct approach in tcp(7) and does not suffer from the same limitation.
A test is included which checks that the mechanism, including the inheritance, works.
This was discovered due to very poor throughput between a remote client with ~250 rtt and Redpanda: this transfer is receive window limited and benefits from buffers > 1 MB, but using such buffers (we set the configured buffer size on the connected_socket immediately after connection) had no effect despite the change taking effect per ss. The problem was that setting it "too late" prevents the receiving side from advertising the larger size in its receive window. Arguably a kernel flaw? The window scale was set high enough (128x) even when setting it in this way, so that wasn't preventing the window from scaling though this is often given as the reason why "late setting" the recv buffer size does not work.
@nyh wrote:
Looks good, I just have a comment about a typo, and one nitpick about the test: I think it's better to test with different send and receive buffer sizes, to make sure you don't have copy-pasto bugs in the implementation (e.g., setting send buffer size from the receive buffer parameter).
Thanks for the feedback: I believe I've addressed all of it in force:
https://github.com/scylladb/seastar/commit/5b33a115045eeb037ff41db9c22c213e45808508
I also added a final test case which tests the "nullopt" sizes for both buffers: hard to validate that it DTRT but I check at least that the discovered size falls in the expected range.
@nyh - all feedback addressed, please have a look when you have a chance, thanks!