The write system call in the container is stuck, and the copier goroutine in Dockerd is waiting
Description
At 2023-01-07 18:39:25,228, we found that the log writing request initiated in the Java application was stuck, and the Java stack frame was:
"tomcat-exec-1" #142 daemon prio=5 os_prio=0 tid=0x00007efc88300000 nid=0x9b runnable [0x00007efbfe780000]
java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
- locked <0x00000006c0262dc0> (a java.io.BufferedOutputStream)
at java.io.PrintStream.write(PrintStream.java:480)
- locked <0x00000006c0262da0> (a java.io.PrintStream)
at org.apache.tomcat.util.log.SystemLogHandler.write(SystemLogHandler.java:180)
at org.apache.logging.log4j.core.util.CloseShieldOutputStream.write(CloseShieldOutputStream.java:53)
at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:250)
- eliminated <0x00000006c08625f8> (a org.apache.logging.log4j.core.appender.OutputStreamManager)
at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:283)
- locked <0x00000006c08625f8> (a org.apache.logging.log4j.core.appender.OutputStreamManager)
at org.apache.logging.log4j.core.appender.OutputStreamManager.drain(OutputStreamManager.java:343)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.drainIfByteBufferFull(TextEncoderHelper.java:260)
- locked <0x00000006c08625f8> (a org.apache.logging.log4j.core.appender.OutputStreamManager)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.writeAndEncodeAsMuchAsPossible(TextEncoderHelper.java:199)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeChunkedText(TextEncoderHelper.java:159)
- locked <0x00000006c08625f8> (a org.apache.logging.log4j.core.appender.OutputStreamManager)
at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(TextEncoderHelper.java:58)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:68)
at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:227)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:59)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:215)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:208)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:199)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:161)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:134)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:125)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:89)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:675)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:633)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:616)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:552)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
at org.apache.logging.log4j.core.Logger.log(Logger.java:161)
at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2205)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2159)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2142)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2052)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1923)
at org.apache.logging.log4j.spi.AbstractLogger.error(AbstractLogger.java:881)
at com.cqtouch.web.smallweb.SmallWebLogHandlerInterceptor.logSend(SmallWebLogHandlerInterceptor.java:90)
at com.cqtouch.web.smallweb.SmallWebLogHandlerInterceptor.postHandle(SmallWebLogHandlerInterceptor.java:73)
at org.springframework.web.servlet.HandlerExecutionChain.applyPostHandle(HandlerExecutionChain.java:152)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:940)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:867)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:951)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:853)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:682)
at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:765)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at com.cqtouch.web.smallweb.CrossFilter.doFilter(CrossFilter.java:43)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at com.yt.ytdeep.interceptor.GizFilter.doFilter(GizFilter.java:31)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:106)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:177)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:543)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:367)
at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:639)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:885)
at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1688)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
- locked <0x00000006ed6d1fd0> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:750)
Through nid=0x9b we know that the lwpId in the container is 155.
We checked the corresponding system calls through strace:
root@c8273fb5cd8c:/opt/tool# strace -T -p 155
strace: Process 155 attached
write(1, "/wf8aAEYAPAD8/3n/N/9i/8j/PwBYAD8"..., 4352
It can be seen that the process is stuck on the write system call, and the thread outputs data to the standard output stream.
We look at the stack of the process:
root@c8273fb5cd8c:/opt/tool# cat /proc/155/stack
[<0>] tty_write_lock+0x2a/0x50
[<0>] tty_write+0xbc/0x2a0
[<0>] __vfs_write+0x1b/0x40
[<0>] vfs_write+0xb9/0x1a0
[<0>] ksys_write+0x67/0xe0
[<0>] __x64_sys_write+0x1a/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Seems to be waiting for a write lock in the kernel.
Meanwhile we found another stuck FileOutputStream.writeBytes call in the Java stack frame:
"http-nio-80-Poller" #167 daemon prio=5 os_prio=0 tid=0x00007efc88871000 nid=0xb4 runnable [0x00007efb55a20000]
java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
- locked <0x00000006c01dbce0> (a java.io.BufferedOutputStream)
at java.io.PrintStream.write(PrintStream.java:480)
- locked <0x00000006c01dbcc0> (a java.io.PrintStream)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
- locked <0x00000006c0e50ff0> (a java.io.OutputStreamWriter)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at java.util.logging.StreamHandler.flush(StreamHandler.java:259)
- locked <0x00000006c0e50f70> (a java.util.logging.ConsoleHandler)
at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:117)
at java.util.logging.Logger.log(Logger.java:738)
at java.util.logging.Logger.doLog(Logger.java:765)
at java.util.logging.Logger.logp(Logger.java:1042)
at org.apache.juli.logging.DirectJDKLog.log(DirectJDKLog.java:175)
at org.apache.juli.logging.DirectJDKLog.warn(DirectJDKLog.java:131)
at org.apache.tomcat.util.net.AbstractEndpoint.processSocket(AbstractEndpoint.java:1187)
at org.apache.tomcat.util.net.NioEndpoint$Poller.processKey(NioEndpoint.java:782)
at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:749)
at java.lang.Thread.run(Thread.java:750)
Through nid=0xb4 we know that the lwpId in the container is 180.
We checked the corresponding system calls through strace:
root@c8273fb5cd8c:/opt/tool# strace -p 180
strace: Process 180 attached
write(2, "07-Jan-2023 18:41:12.100 WARNING"..., 1007
It can be seen that the process is stuck on the write system call, and the thread outputs data to the standard error stream.
We look at the stack of the process:
root@c8273fb5cd8c:/opt/tool# cat /proc/180/stack
[<0>] wait_woken+0x48/0x80
[<0>] n_tty_write+0x3f0/0x4c0
[<0>] tty_write+0x195/0x2a0
[<0>] __vfs_write+0x1b/0x40
[<0>] vfs_write+0xb9/0x1a0
[<0>] ksys_write+0x67/0xe0
[<0>] __x64_sys_write+0x1a/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Seems to be waiting to be awakened.
We use docker logs to view the container's log, and found that the log is no longer scrolling, guessing that Dockerd's copier did not perform the corresponding work. So we dump the Dockerd stack frame, found that both Copier goroutines are waiting to wake up:
goroutine 102986 [sync.Cond.Wait, 12819 minutes]:
sync.runtime_notifyListWait(0xc0010c9c50, 0x0)
/usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x0?)
/usr/local/go/src/sync/cond.go:56 +0x8c
github.com/docker/docker/pkg/ioutils.(*BytesPipe).Read(0xc0010c9c00, {0xc000d9a000, 0x800, 0x4000})
/go/src/github.com/docker/docker/pkg/ioutils/bytespipe.go:135 +0x8e
github.com/docker/docker/daemon/logger.(*Copier).copySrc(0xc0010c9c80, {0x562b522ac1e4, 0x6}, {0x7f67e829bef0, 0xc0010c9c00})
/go/src/github.com/docker/docker/daemon/logger/copier.go:86 +0x205
created by github.com/docker/docker/daemon/logger.(*Copier).Run
/go/src/github.com/docker/docker/daemon/logger/copier.go:48 +0x6c
goroutine 102985 [sync.Cond.Wait, 5657 minutes]:
sync.runtime_notifyListWait(0xc0010c9bd0, 0xc1adb)
/usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0xc0017ee5a0?)
/usr/local/go/src/sync/cond.go:56 +0x8c
github.com/docker/docker/pkg/ioutils.(*BytesPipe).Read(0xc0010c9b80, {0xc000e28ffd, 0x800, 0x1003})
/go/src/github.com/docker/docker/pkg/ioutils/bytespipe.go:135 +0x8e
github.com/docker/docker/daemon/logger.(*Copier).copySrc(0xc0010c9c80, {0x562b522ac1ea, 0x6}, {0x7f67e829bef0, 0xc0010c9b80})
/go/src/github.com/docker/docker/daemon/logger/copier.go:86 +0x205
created by github.com/docker/docker/daemon/logger.(*Copier).Run
/go/src/github.com/docker/docker/daemon/logger/copier.go:48 +0x6c
The stack frame of Dockerd was captured at 2023-01-11 16:56:05, minus 5657 minutes is 2023-01-07 18:39:05, which is about the time when the process with lwpId 155 was stuck.
I don't know why this happens, what could be the cause of this problem? How should I investigate further?
Reproduce
Can not reproduce.
Expected behavior
Application-triggered write system call no longer get stuck.
docker version
Client: Docker Engine - Community
Version: 20.10.18
API version: 1.41
Go version: go1.18.6
Git commit: b40c2f6
Built: Thu Sep 8 23:11:45 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.18
API version: 1.41 (minimum version 1.12)
Go version: go1.18.6
Git commit: e42327a
Built: Thu Sep 8 23:09:37 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.8
GitCommit: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
compose: Docker Compose (Docker Inc., v2.10.2)
scan: Docker Scan (Docker Inc., v0.17.0)
Server:
Containers: 12
Running: 1
Paused: 0
Stopped: 11
Images: 12
Server Version: 20.10.18
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.4.0-125-generic
Operating System: Ubuntu 20.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.36GiB
Name: shared-product-2
ID: BDEL:L6M6:RU2R:FEZQ:DPKW:SCEN:TTZG:J5CC:FWKY:UM5X:ZBFT:4QK5
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 60
Goroutines: 59
System Time: 2023-01-11T16:53:19.272620338+08:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No swap limit support
Additional Info
Now the write system call in this container is stuck. We can use this container to continue troubleshooting. This is the second time this problem has occurred recently.
Could you also check if the host's kernel log contains anything helpful/unusual?
Could you also check if the host's kernel log contains anything helpful/unusual?
There is no associated error log for the matching time point. Any suggested troubleshooting ideas?
There are no goroutines blocked waiting on a condition variable inside ioutils.(*BytesPipe).Write so this isn't a case of that known deadlock bug. There are four goroutines inside fifo.(*fifo).Read which I assume are the copier goroutines for stdout and stderr for the one running container plus one exec. All four goroutines are blocked in IO wait so nothing appears amiss at the application level. My best explanation is that either the Go runtime is broken, or the kernel. And it just might be the kernel. (Related Launchpad issue.)
Thanks, I will upgrade the kernel version and check if the problem persists.