otp icon indicating copy to clipboard operation
otp copied to clipboard

Socket NIF recv buffer allocation issue

Open zzydxm opened this issue 2 years ago • 17 comments

Describe the bug socket:recv(Socket, 0) returns a sub-binary of the recv buffer. It could happen that even if small amount (>64 byte) of data is received, the recv buffer is not freed until the sub-binary is freed. A reasonable amount of such sub-binaries could easily OOM the node.

Another issue related to this is that prim_socket_nif.c:8727 ESOCK_ASSERT( ALLOC_BIN(bufSz, &buf) ); will do a esock_abort coredump when it failed to allocate the binary (VSZ limit is reached), instead of a crashdump when BEAM cannot allocate a binary. It would be very helpful if socket NIF can send a signal for crashdump instead of just coredumping

To Reproduce Note that this test uses 50GB VSZ

terminal 1:

{ok, ListenSocket} = socket:open(inet6, stream, tcp).
    ok = socket:bind(ListenSocket, #{
        family => inet6,
        port => 33334,
        addr => {0,0,0,0,0,0,0,1}
    }).
    ok = socket:listen(ListenSocket).
{ok, Socket} = socket:accept(ListenSocket).
socket:setopt(Socket, socket, rcvbuf, 10485760).
socket:setopt(Socket, otp, rcvbuf, 10485760).
A=[begin timer:sleep(100), socket:recv(Socket,0,nowait) end||_<-lists:seq(1,5000)].
erlang:memory().

terminal 2:

{ok, Socket} = socket:open(inet6, stream, tcp).
ok = socket:connect(Socket, #{
        family => inet6,
        port => 33334,
        addr => {0,0,0,0,0,0,0,1}
    }).
[begin timer:sleep(100), socket:send(Socket,crypto:strong_rand_bytes(65)) end||_<-lists:seq(1,5000)].

Expected behavior

  1. Memory used by the binary returned socket:recv(Socket, 0) should be in magnitude of the binary size, instead of the otp rcvbuf size.
  2. It is better for socket NIF to do crashdump instead of coredump when OOM-ing

Affected versions OTP 25

Additional context To solve the binary size issue, we can make subbinary if bytes received > (rcvbuf size) / 2, but do a copy and make a new binary if bytes received < (rcvbuf size) / 2

zzydxm avatar Jul 13 '22 16:07 zzydxm

The alternative to making a sub-binary is to reallocate the binary. I suspect creating a sub-binary was chosen for speed reasons. Your suggestion may be a good compromize. We will have a look at it.

I will ask around if it is feasible / easy to initiate an Erlang crashdump from a NIF. For an allocation error that might be better than aborting, depending on how it should be done.

RaimoNiskanen avatar Jul 14 '22 10:07 RaimoNiskanen

The alternative to making a sub-binary is to reallocate the binary. I suspect creating a sub-binary was chosen for speed reasons. Your suggestion may be a good compromize. We will have a look at it.

I will ask around if it is feasible / easy to initiate an Erlang crashdump from a NIF. For an allocation error that might be better than aborting, depending on how it should be done.

Sounds great, thanks!

zzydxm avatar Jul 14 '22 16:07 zzydxm

+1 on Memory used by the binary returned socket:recv(Socket, 0) should be in magnitude of the binary size, instead of the otp rcvbuf size.

Also the socket opts are confusingly worded but on gen_tcp the option is buffer and rcvbuf does something totally different.

vans163 avatar Jul 14 '22 17:07 vans163

@zzydxm: For a static driver it is entirely possible to initiate an erl_crashdump (on all platforms), and it is kind of a policy to do a crashdump for out-of-memory (facilitates debugging, as you pointed out), so that can be done.

@vans163: For socket we could introduce the otp protocol level, and then we could use the better name rcvbuf borrowed from the socket level. inet_drv.c had to invent a different atom name for the internal option, and threw in some heuristics while at it (setting buffer when sndbuf or rcvbuf was set).

And I would say that {otp,rcvbuf} is not confusingly worded, but rather almost completely not described. We should improve that... ;-)

RaimoNiskanen avatar Jul 15 '22 09:07 RaimoNiskanen

And I would say that {otp,rcvbuf} is not confusingly worded, but rather almost completely not described. We should improve that... ;-)

Indeed. Also would you know if it reserves the memory upfront, or is that just the maximum it can read? (example for large latency high thruput scenarios we need to grow the kernel max_rcvbuf to 512MB or more [if TCP sliding window so decides it can use that much]), if we set the otp buffer to say 512MB but there is never more than say 8MB in the kernel recvbuf at anytime socket|gen_tcp:recv is called, how much memory is used on otp side? 8MB or 512MB?

vans163 avatar Jul 15 '22 13:07 vans163

The buffer is allocated just before calling the system call recv() and its size is passed to recv() as the size to receive.

When size 0 is used for receive then the default buffer size is used i.e {otp,rcvbuf}.

When the receive operation has finished the current code slices a sub-binary from the allocated binary; a speed optimization, since the allocated buffer original remains allocated. There are alternative behaviours mentioned above.

When the sub-binary gets garbage collected the large buffer binary is freed.

RaimoNiskanen avatar Jul 15 '22 14:07 RaimoNiskanen

When size 0 is used for receive then the default buffer size is used i.e {otp,rcvbuf}.

So would then a new 512MB buffer be allocated on the OTP side each time recv is called (if otp,rcvbuf is set to 512MB)?

vans163 avatar Jul 15 '22 14:07 vans163

For every receive size 0, yes. :-(

RaimoNiskanen avatar Jul 15 '22 14:07 RaimoNiskanen

For every receive size 0, yes. :-(

What about active, once?

It seems there are a lot of optimizations to be had then (example is returning an iolist of smaller reads for example [512MB buffer, we keep allocating+reading 1MB from recv for example {in the nif/bif/vm} instead of allocating+asking_to_read_up_to_512MB] from the recv call but thats beyond the scope of this)

vans163 avatar Jul 15 '22 16:07 vans163

All active modes use receive size 0 so they exhibit the same speed optimization / memory waste.

Whenever the libc call recv(2) returns an active data message is delivered. For a stream socket the active mode is supposed to return as soon as there is data so the OS determines how big chunks that are delivered. For a datagram socket any active mode must keep the message boundaries. So I do not see how we could deliver iolists of smaller reads.

The {otp,recbuf} option sets the buffer size to use for receive size 0 (and active mode). We should use the buffer size that the user sets. What we can do is something better than to keep lots of unused buffer space allocated in vain, after returning the received data to the user.

RaimoNiskanen avatar Jul 18 '22 09:07 RaimoNiskanen

That makes sense.

So I do not see how we could deliver iolists of smaller reads.

What I meant here is instead of allocating 1 large buffer (512MB), we allocate 16kb or 160k or something, and keep alloc+reading+catting the iolist until socket returns eagain. Then return the iolist to managed erlang code. Like to put the problem in perspective, default tuned erlang + linux you cant recv faster than 4mbps off a tcp socket when having 130ms latency.

vans163 avatar Jul 18 '22 15:07 vans163

What I meant here is instead of allocating 1 large buffer (512MB), we allocate 16kb or 160k or something, and keep ...

Alright! The {otp,buffer} option allows setting {N,Size} which limits receiving to N iterations before giving up, but that today implements N attempts to fill the one buffer. This could maybe be changed as you suggests, or be extended, to do N iterations of allocating a new Size buffer and returning an iolist...

RaimoNiskanen avatar Jul 22 '22 09:07 RaimoNiskanen

What I meant here is instead of allocating 1 large buffer (512MB), we allocate 16kb or 160k or something, and keep ...

Alright! The {otp,buffer} option allows setting {N,Size} which limits receiving to N iterations before giving up, but that today implements N attempts to fill the one buffer. This could maybe be changed as you suggests, or be extended, to do N iterations of allocating a new Size buffer and returning an iolist...

Oh interesting this is very useful, so try up to N times OR failure whichever comes first? Or try N times irregardless of failure (EAGAIN)?

From the docs:

N specifies the number of read attempts to do in a tight loop before assuming no more data is pending.

So if read returns EAGAIN we stop?

vans163 avatar Jul 22 '22 13:07 vans163

Sry. This option does not do precisely what I thought.

For socket:recv/1,2,3,4 with Length = 0 and Timeout = integer()>0 | infinity, that is a read-me-everything with a timeout > 0, if {otp,buffer} is set to {N,Size}, then the code in socket:recv/* loops over prim_socket:recv/4 and concatenates the received binaries. The loop ends if prim_socket:recv/4 does not fill a Size buffer, or after N loops, and returns the concatenated binary.

There is no tight loop in the NIF.

If the recv(2) call returns EAGAIN, the NIF returns with a select handle and the Erlang code has to wait in receive for a select message and then call the NIF again.

RaimoNiskanen avatar Aug 01 '22 14:08 RaimoNiskanen

Thanks. Then it probably is not too much a difference to handle the looping/buffer-catting yourself, or use socket module directly to read off a gen_tcp socket with inet_backend set.

vans163 avatar Aug 01 '22 16:08 vans163

With inet_backend = socket, gen_tcp:recv/2,3 translates into socket:recv/2, via a buffering process (that implements packet modes, if needed). If you start nicking data from under the feet of this gen_tcp_socket process it might get confused. It should be safer to use the socket module completely, in that case?

And, yes, it should not be much difference to handle looping/buffer-catting yourself. Then you can also instead of concatenating into a result binary create a list of binaries, if that is ok for the application.

It remains a future improvement to implement a tight loop in the NIF, and it remains an open question if it would be an improvement. Such a tight loop might give too much priority to incoming data and contribute to choking the system.

RaimoNiskanen avatar Aug 02 '22 08:08 RaimoNiskanen

Using the socket module directly makes using SSL complex.

And yea on the improvements I think its fine as is, the next major improvement would probably be io_uring support so anything done to the existing implementation will be redundant anyways.

vans163 avatar Aug 02 '22 19:08 vans163