redis icon indicating copy to clipboard operation
redis copied to clipboard

[BUG] During Bgsave, failed to disconnect idle client connection properly

Open Neway13 opened this issue 3 years ago • 6 comments

Describe the bug

During Bgsave, the client’s command blocked for 4771ms and threw an IOException. By capturing packets, the server did not properly disconnect idle client connection. Normally, the server would have sent the FIN packet, but it didn't.

To reproduce

Service side:

Deployment: A 6 nodes cluster.

Key configurations:

timeout 10
save '60 1'
loglevel verbose

used_memory_human: 2.06G

Client side

Pool size: The minimum size is 8

Command frequency: One command every 3 seconds.

Expected behavior

During Bgsave, the server can properly disconnect idle client connections.

Additional information

The client sends the set command at 01:17:39.940 and throws an exception at 01:17:44.711. It blocked for 4771ms.

2022-01-08 01:17:44.711  WARN 5468 --- [ntloop-thread-7] c.c.m.c.m.c.RedisTestController : The time spent on the set command:4771ms
2022-01-08 01:17:44.711 ERROR 5468 --- [ntloop-thread-7] c.c.m.c.m.c.RedisTestController : error:

java.io.IOException: An existing connection was forcibly closed by the remote host
	at sun.nio.ch.SocketDispatcher.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:745)

packets of client side:

server: 192.168.100.13:6371

client: 192.168.100.3:53901

Note: The time difference between the client and the server is less than 1s.

image

The log of server:

latest_fork_usec:41966
rdb_last_bgsave_time_sec:16

image

I have captured the packets and analyzed the connection that timed out and threw an IOException.

Client action Time line server action
last packet 01:17:27.931
01:17:28.698 log: Background saving started
Expected but did not receive the fin packet 01:17:38.080 log: Closing idle client
command send 01:17:39.940
01:17:44.100 log:Background saving terminated with success
connection threw an IOException 01:17:44.711

Two phenomena indicate that there is something wrong with the server:

  1. 10 seconds after the last packet, the server had a log that closed the idle connection, but the client does not receive it.
  2. An exception is not thrown until the end of the bgsave.

I will try to analyze the server code and solve it. Thanks.

Neway13 avatar Jan 08 '22 03:01 Neway13

i'm guessing that since the fork() child is holding the file descriptors, the close in unlinkClient -> connClose isn't sufficient, and maybe we better use shutdown() instead of close(). @yossigo please take a look.

oranagra avatar Jan 09 '22 07:01 oranagra

@oranagra yes, this is the root cause and using shutdown() is the solution. But we'll need to consider the impact of getting a EBADF in such cases, and possibly make sure childs never end up performing a shutdown() on their side.

yossigo avatar Jan 09 '22 08:01 yossigo

@yossigo FYI, in the distant past i added a shutdown when disconnecting a replica, in order to cause the diskless bgsave child to fail. at that time, Salvatore didn't want to use it on every replica, and added a c->replstate == SLAVE_STATE_WAIT_BGSAVE_END condition. i don't recall what where the arguments. possibly only paranoia.

oranagra avatar Jan 09 '22 08:01 oranagra

Could be happen also during a generic fork (not only BGSAVE)? I'v the same issue on client that was connected to the redis server without any fork running, when the server start the fork of the process (maybe for load reason), the connected client stop to works until fork stops or will be killed, than have an Exception.

Do you think that is possible to customize a proper module adding a command to start and stop a RedisModuleFork properly?

awesomecit avatar Jun 15 '22 20:06 awesomecit

when the server start the fork of the process (maybe for load reason), the connected client stop to works until fork stops or will be killed

@awesomecit what do you mean by "stop to work"? not being able to issue commands and read responses? that makes no sense to me, and i'm quite sure we would have noticed already if such a problem existed.

the problem that we discussed was about the fact that closing the socket on the parent process, won't terminate the connection as long as the fork still has a hold of the file descriptor.

oranagra avatar Jun 16 '22 05:06 oranagra

@yossigo maybe now is the time to try and change this? (use shutdown), when we're far enough from the next release? maybe we can try being conservative, and only use shutdown if a fork is active? anyone else has any concerns or suggestions?

oranagra avatar Sep 29 '22 09:09 oranagra