otp
otp copied to clipboard
Socket NIF recv buffer allocation issue
Describe the bug socket:recv(Socket, 0) returns a sub-binary of the recv buffer. It could happen that even if small amount (>64 byte) of data is received, the recv buffer is not freed until the sub-binary is freed. A reasonable amount of such sub-binaries could easily OOM the node.
Another issue related to this is that prim_socket_nif.c:8727 ESOCK_ASSERT( ALLOC_BIN(bufSz, &buf) );
will do a esock_abort coredump when it failed to allocate the binary (VSZ limit is reached), instead of a crashdump when BEAM cannot allocate a binary. It would be very helpful if socket NIF can send a signal for crashdump instead of just coredumping
To Reproduce Note that this test uses 50GB VSZ
terminal 1:
{ok, ListenSocket} = socket:open(inet6, stream, tcp).
ok = socket:bind(ListenSocket, #{
family => inet6,
port => 33334,
addr => {0,0,0,0,0,0,0,1}
}).
ok = socket:listen(ListenSocket).
{ok, Socket} = socket:accept(ListenSocket).
socket:setopt(Socket, socket, rcvbuf, 10485760).
socket:setopt(Socket, otp, rcvbuf, 10485760).
A=[begin timer:sleep(100), socket:recv(Socket,0,nowait) end||_<-lists:seq(1,5000)].
erlang:memory().
terminal 2:
{ok, Socket} = socket:open(inet6, stream, tcp).
ok = socket:connect(Socket, #{
family => inet6,
port => 33334,
addr => {0,0,0,0,0,0,0,1}
}).
[begin timer:sleep(100), socket:send(Socket,crypto:strong_rand_bytes(65)) end||_<-lists:seq(1,5000)].
Expected behavior
- Memory used by the binary returned socket:recv(Socket, 0) should be in magnitude of the binary size, instead of the otp rcvbuf size.
- It is better for socket NIF to do crashdump instead of coredump when OOM-ing
Affected versions OTP 25
Additional context To solve the binary size issue, we can make subbinary if bytes received > (rcvbuf size) / 2, but do a copy and make a new binary if bytes received < (rcvbuf size) / 2
The alternative to making a sub-binary is to reallocate the binary. I suspect creating a sub-binary was chosen for speed reasons. Your suggestion may be a good compromize. We will have a look at it.
I will ask around if it is feasible / easy to initiate an Erlang crashdump from a NIF. For an allocation error that might be better than aborting, depending on how it should be done.
The alternative to making a sub-binary is to reallocate the binary. I suspect creating a sub-binary was chosen for speed reasons. Your suggestion may be a good compromize. We will have a look at it.
I will ask around if it is feasible / easy to initiate an Erlang crashdump from a NIF. For an allocation error that might be better than aborting, depending on how it should be done.
Sounds great, thanks!
+1 on Memory used by the binary returned socket:recv(Socket, 0) should be in magnitude of the binary size, instead of the otp rcvbuf size.
Also the socket opts are confusingly worded but on gen_tcp the option is buffer
and rcvbuf
does something totally different.
@zzydxm: For a static driver it is entirely possible to initiate an erl_crashdump (on all platforms), and it is kind of a policy to do a crashdump for out-of-memory (facilitates debugging, as you pointed out), so that can be done.
@vans163: For socket
we could introduce the otp
protocol level, and then we could use the better name rcvbuf
borrowed from the socket
level. inet_drv.c had to invent a different atom name for the internal option, and threw in some heuristics while at it (setting buffer
when sndbuf
or rcvbuf
was set).
And I would say that {otp,rcvbuf}
is not confusingly worded, but rather almost completely not described. We should improve that... ;-)
And I would say that
{otp,rcvbuf}
is not confusingly worded, but rather almost completely not described. We should improve that... ;-)
Indeed. Also would you know if it reserves the memory upfront, or is that just the maximum it can read? (example for large latency high thruput scenarios we need to grow the kernel max_rcvbuf to 512MB or more [if TCP sliding window so decides it can use that much]), if we set the otp buffer to say 512MB but there is never more than say 8MB in the kernel recvbuf at anytime socket|gen_tcp:recv
is called, how much memory is used on otp side? 8MB or 512MB?
The buffer is allocated just before calling the system call recv() and its size is passed to recv() as the size to receive.
When size 0 is used for receive then the default buffer size is used i.e {otp,rcvbuf}
.
When the receive operation has finished the current code slices a sub-binary from the allocated binary; a speed optimization, since the allocated buffer original remains allocated. There are alternative behaviours mentioned above.
When the sub-binary gets garbage collected the large buffer binary is freed.
When size 0 is used for receive then the default buffer size is used i.e
{otp,rcvbuf}
.
So would then a new 512MB buffer be allocated on the OTP side each time recv is called (if otp,rcvbuf is set to 512MB)?
For every receive size 0, yes. :-(
For every receive size 0, yes. :-(
What about active, once
?
It seems there are a lot of optimizations to be had then (example is returning an iolist of smaller reads for example [512MB buffer, we keep allocating+reading 1MB from recv for example {in the nif/bif/vm} instead of allocating+asking_to_read_up_to_512MB] from the recv call but thats beyond the scope of this)
All active
modes use receive size 0 so they exhibit the same speed optimization / memory waste.
Whenever the libc call recv(2) returns an active data message is delivered. For a stream socket the active
mode is supposed to return as soon as there is data so the OS determines how big chunks that are delivered. For a datagram socket any active
mode must keep the message boundaries. So I do not see how we could deliver iolists of smaller reads.
The {otp,recbuf}
option sets the buffer size to use for receive size 0 (and active mode). We should use the buffer size that the user sets. What we can do is something better than to keep lots of unused buffer space allocated in vain, after returning the received data to the user.
That makes sense.
So I do not see how we could deliver iolists of smaller reads.
What I meant here is instead of allocating 1 large buffer (512MB), we allocate 16kb or 160k or something, and keep alloc+reading+catting the iolist until socket returns eagain. Then return the iolist to managed erlang code. Like to put the problem in perspective, default tuned erlang + linux you cant recv faster than 4mbps off a tcp socket when having 130ms latency.
What I meant here is instead of allocating 1 large buffer (512MB), we allocate 16kb or 160k or something, and keep ...
Alright! The {otp,buffer}
option allows setting {N,Size}
which limits receiving to N
iterations before giving up, but that today implements N
attempts to fill the one buffer. This could maybe be changed as you suggests, or be extended, to do N
iterations of allocating a new Size
buffer and returning an iolist...
What I meant here is instead of allocating 1 large buffer (512MB), we allocate 16kb or 160k or something, and keep ...
Alright! The
{otp,buffer}
option allows setting{N,Size}
which limits receiving toN
iterations before giving up, but that today implementsN
attempts to fill the one buffer. This could maybe be changed as you suggests, or be extended, to doN
iterations of allocating a newSize
buffer and returning an iolist...
Oh interesting this is very useful, so try up to N times OR failure whichever comes first? Or try N times irregardless of failure (EAGAIN)?
From the docs:
N specifies the number of read attempts to do in a tight loop before assuming no more data is pending.
So if read returns EAGAIN we stop?
Sry. This option does not do precisely what I thought.
For socket:recv/1,2,3,4
with Length = 0
and Timeout = integer()>0 | infinity
, that is a read-me-everything with a timeout > 0, if {otp,buffer}
is set to {N,Size}
, then the code in socket:recv/*
loops over prim_socket:recv/4
and concatenates the received binaries. The loop ends if prim_socket:recv/4
does not fill a Size
buffer, or after N
loops, and returns the concatenated binary.
There is no tight loop in the NIF.
If the recv(2) call returns EAGAIN, the NIF returns with a select handle and the Erlang code has to wait in receive
for a select message and then call the NIF again.
Thanks. Then it probably is not too much a difference to handle the looping/buffer-catting yourself, or use socket
module directly to read off a gen_tcp socket with inet_backend
set.
With inet_backend = socket
, gen_tcp:recv/2,3
translates into socket:recv/2
, via a buffering process (that implements packet
modes, if needed). If you start nicking data from under the feet of this gen_tcp_socket
process it might get confused. It should be safer to use the socket
module completely, in that case?
And, yes, it should not be much difference to handle looping/buffer-catting yourself. Then you can also instead of concatenating into a result binary create a list of binaries, if that is ok for the application.
It remains a future improvement to implement a tight loop in the NIF, and it remains an open question if it would be an improvement. Such a tight loop might give too much priority to incoming data and contribute to choking the system.
Using the socket module directly makes using SSL complex.
And yea on the improvements I think its fine as is, the next major improvement would probably be io_uring support so anything done to the existing implementation will be redundant anyways.