redis
redis copied to clipboard
[BUG] During Bgsave, failed to disconnect idle client connection properly
Describe the bug
During Bgsave, the client’s command blocked for 4771ms and threw an IOException. By capturing packets, the server did not properly disconnect idle client connection. Normally, the server would have sent the FIN packet, but it didn't.
To reproduce
Service side:
Deployment: A 6 nodes cluster.
Key configurations:
timeout 10 save '60 1' loglevel verbose
used_memory_human: 2.06G
Client side
Pool size: The minimum size is 8
Command frequency: One command every 3 seconds.
Expected behavior
During Bgsave, the server can properly disconnect idle client connections.
Additional information
The client sends the set command at 01:17:39.940 and throws an exception at 01:17:44.711. It blocked for 4771ms.
2022-01-08 01:17:44.711 WARN 5468 --- [ntloop-thread-7] c.c.m.c.m.c.RedisTestController : The time spent on the set command:4771ms
2022-01-08 01:17:44.711 ERROR 5468 --- [ntloop-thread-7] c.c.m.c.m.c.RedisTestController : error:
java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:745)
packets of client side:
server: 192.168.100.13:6371
client: 192.168.100.3:53901
Note: The time difference between the client and the server is less than 1s.
The log of server:
latest_fork_usec:41966
rdb_last_bgsave_time_sec:16
I have captured the packets and analyzed the connection that timed out and threw an IOException.
Client action | Time line | server action |
---|---|---|
last packet | 01:17:27.931 | |
01:17:28.698 | log: Background saving started | |
Expected but did not receive the fin packet | 01:17:38.080 | log: Closing idle client |
command send | 01:17:39.940 | |
01:17:44.100 | log:Background saving terminated with success | |
connection threw an IOException | 01:17:44.711 |
Two phenomena indicate that there is something wrong with the server:
- 10 seconds after the last packet, the server had a log that closed the idle connection, but the client does not receive it.
- An exception is not thrown until the end of the bgsave.
I will try to analyze the server code and solve it. Thanks.
i'm guessing that since the fork()
child is holding the file descriptors, the close
in unlinkClient
-> connClose
isn't sufficient, and maybe we better use shutdown()
instead of close()
.
@yossigo please take a look.
@oranagra yes, this is the root cause and using shutdown()
is the solution. But we'll need to consider the impact of getting a EBADF
in such cases, and possibly make sure childs never end up performing a shutdown()
on their side.
@yossigo FYI, in the distant past i added a shutdown
when disconnecting a replica, in order to cause the diskless bgsave child to fail.
at that time, Salvatore didn't want to use it on every replica, and added a c->replstate == SLAVE_STATE_WAIT_BGSAVE_END
condition. i don't recall what where the arguments. possibly only paranoia.
Could be happen also during a generic fork (not only BGSAVE)? I'v the same issue on client that was connected to the redis server without any fork running, when the server start the fork of the process (maybe for load reason), the connected client stop to works until fork stops or will be killed, than have an Exception.
Do you think that is possible to customize a proper module adding a command to start and stop a RedisModuleFork properly?
when the server start the fork of the process (maybe for load reason), the connected client stop to works until fork stops or will be killed
@awesomecit what do you mean by "stop to work"? not being able to issue commands and read responses? that makes no sense to me, and i'm quite sure we would have noticed already if such a problem existed.
the problem that we discussed was about the fact that closing the socket on the parent process, won't terminate the connection as long as the fork still has a hold of the file descriptor.
@yossigo maybe now is the time to try and change this? (use shutdown
), when we're far enough from the next release?
maybe we can try being conservative, and only use shutdown if a fork is active?
anyone else has any concerns or suggestions?