moby icon indicating copy to clipboard operation
moby copied to clipboard

The write system call in the container is stuck, and the copier goroutine in Dockerd is waiting

Open tianshuang opened this issue 3 years ago • 5 comments

Description

At 2023-01-07 18:39:25,228, we found that the log writing request initiated in the Java application was stuck, and the Java stack frame was:

"tomcat-exec-1" #142 daemon prio=5 os_prio=0 tid=0x00007efc88300000 nid=0x9b runnable [0x00007efbfe780000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
	- locked <0x00000006c0262dc0> (a java.io.BufferedOutputStream)
	at java.io.PrintStream.write(PrintStream.java:480)
	- locked <0x00000006c0262da0> (a java.io.PrintStream)
	at org.apache.tomcat.util.log.SystemLogHandler.write(SystemLogHandler.java:180)
	at org.apache.logging.log4j.core.util.CloseShieldOutputStream.write(CloseShieldOutputStream.java:53)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:250)
	- eliminated <0x00000006c08625f8> (a org.apache.logging.log4j.core.appender.OutputStreamManager)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.flushBuffer(OutputStreamManager.java:283)
	- locked <0x00000006c08625f8> (a org.apache.logging.log4j.core.appender.OutputStreamManager)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.drain(OutputStreamManager.java:343)
	at org.apache.logging.log4j.core.layout.TextEncoderHelper.drainIfByteBufferFull(TextEncoderHelper.java:260)
	- locked <0x00000006c08625f8> (a org.apache.logging.log4j.core.appender.OutputStreamManager)
	at org.apache.logging.log4j.core.layout.TextEncoderHelper.writeAndEncodeAsMuchAsPossible(TextEncoderHelper.java:199)
	at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeChunkedText(TextEncoderHelper.java:159)
	- locked <0x00000006c08625f8> (a org.apache.logging.log4j.core.appender.OutputStreamManager)
	at org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(TextEncoderHelper.java:58)
	at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:68)
	at org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
	at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:227)
	at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:59)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:215)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:208)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:199)
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:161)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:134)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:125)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:89)
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:675)
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:633)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:616)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:552)
	at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
	at org.apache.logging.log4j.core.Logger.log(Logger.java:161)
	at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2205)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2159)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2142)
	at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2052)
	at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1923)
	at org.apache.logging.log4j.spi.AbstractLogger.error(AbstractLogger.java:881)
	at com.cqtouch.web.smallweb.SmallWebLogHandlerInterceptor.logSend(SmallWebLogHandlerInterceptor.java:90)
	at com.cqtouch.web.smallweb.SmallWebLogHandlerInterceptor.postHandle(SmallWebLogHandlerInterceptor.java:73)
	at org.springframework.web.servlet.HandlerExecutionChain.applyPostHandle(HandlerExecutionChain.java:152)
	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:940)
	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:867)
	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:951)
	at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:853)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:682)
	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:827)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:765)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at com.cqtouch.web.smallweb.CrossFilter.doFilter(CrossFilter.java:43)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at com.yt.ytdeep.interceptor.GizFilter.doFilter(GizFilter.java:31)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:88)
	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:106)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:177)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:543)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:367)
	at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:639)
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:885)
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1688)
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
	- locked <0x00000006ed6d1fd0> (a org.apache.tomcat.util.net.NioEndpoint$NioSocketWrapper)
	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:750)

Through nid=0x9b we know that the lwpId in the container is 155.

We checked the corresponding system calls through strace:

root@c8273fb5cd8c:/opt/tool# strace -T -p 155
strace: Process 155 attached
write(1, "/wf8aAEYAPAD8/3n/N/9i/8j/PwBYAD8"..., 4352

It can be seen that the process is stuck on the write system call, and the thread outputs data to the standard output stream.

We look at the stack of the process:

root@c8273fb5cd8c:/opt/tool# cat /proc/155/stack
[<0>] tty_write_lock+0x2a/0x50
[<0>] tty_write+0xbc/0x2a0
[<0>] __vfs_write+0x1b/0x40
[<0>] vfs_write+0xb9/0x1a0
[<0>] ksys_write+0x67/0xe0
[<0>] __x64_sys_write+0x1a/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Seems to be waiting for a write lock in the kernel.

Meanwhile we found another stuck FileOutputStream.writeBytes call in the Java stack frame:

"http-nio-80-Poller" #167 daemon prio=5 os_prio=0 tid=0x00007efc88871000 nid=0xb4 runnable [0x00007efb55a20000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileOutputStream.writeBytes(Native Method)
	at java.io.FileOutputStream.write(FileOutputStream.java:326)
	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
	- locked <0x00000006c01dbce0> (a java.io.BufferedOutputStream)
	at java.io.PrintStream.write(PrintStream.java:480)
	- locked <0x00000006c01dbcc0> (a java.io.PrintStream)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
	at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
	at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
	- locked <0x00000006c0e50ff0> (a java.io.OutputStreamWriter)
	at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
	at java.util.logging.StreamHandler.flush(StreamHandler.java:259)
	- locked <0x00000006c0e50f70> (a java.util.logging.ConsoleHandler)
	at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:117)
	at java.util.logging.Logger.log(Logger.java:738)
	at java.util.logging.Logger.doLog(Logger.java:765)
	at java.util.logging.Logger.logp(Logger.java:1042)
	at org.apache.juli.logging.DirectJDKLog.log(DirectJDKLog.java:175)
	at org.apache.juli.logging.DirectJDKLog.warn(DirectJDKLog.java:131)
	at org.apache.tomcat.util.net.AbstractEndpoint.processSocket(AbstractEndpoint.java:1187)
	at org.apache.tomcat.util.net.NioEndpoint$Poller.processKey(NioEndpoint.java:782)
	at org.apache.tomcat.util.net.NioEndpoint$Poller.run(NioEndpoint.java:749)
	at java.lang.Thread.run(Thread.java:750)

Through nid=0xb4 we know that the lwpId in the container is 180.

We checked the corresponding system calls through strace:

root@c8273fb5cd8c:/opt/tool# strace -p 180
strace: Process 180 attached
write(2, "07-Jan-2023 18:41:12.100 WARNING"..., 1007

It can be seen that the process is stuck on the write system call, and the thread outputs data to the standard error stream.

We look at the stack of the process:

root@c8273fb5cd8c:/opt/tool# cat /proc/180/stack
[<0>] wait_woken+0x48/0x80
[<0>] n_tty_write+0x3f0/0x4c0
[<0>] tty_write+0x195/0x2a0
[<0>] __vfs_write+0x1b/0x40
[<0>] vfs_write+0xb9/0x1a0
[<0>] ksys_write+0x67/0xe0
[<0>] __x64_sys_write+0x1a/0x20
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Seems to be waiting to be awakened.

We use docker logs to view the container's log, and found that the log is no longer scrolling, guessing that Dockerd's copier did not perform the corresponding work. So we dump the Dockerd stack frame, found that both Copier goroutines are waiting to wake up:

goroutine 102986 [sync.Cond.Wait, 12819 minutes]:
sync.runtime_notifyListWait(0xc0010c9c50, 0x0)
	/usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0x0?)
	/usr/local/go/src/sync/cond.go:56 +0x8c
github.com/docker/docker/pkg/ioutils.(*BytesPipe).Read(0xc0010c9c00, {0xc000d9a000, 0x800, 0x4000})
	/go/src/github.com/docker/docker/pkg/ioutils/bytespipe.go:135 +0x8e
github.com/docker/docker/daemon/logger.(*Copier).copySrc(0xc0010c9c80, {0x562b522ac1e4, 0x6}, {0x7f67e829bef0, 0xc0010c9c00})
	/go/src/github.com/docker/docker/daemon/logger/copier.go:86 +0x205
created by github.com/docker/docker/daemon/logger.(*Copier).Run
	/go/src/github.com/docker/docker/daemon/logger/copier.go:48 +0x6c

goroutine 102985 [sync.Cond.Wait, 5657 minutes]:
sync.runtime_notifyListWait(0xc0010c9bd0, 0xc1adb)
	/usr/local/go/src/runtime/sema.go:513 +0x13d
sync.(*Cond).Wait(0xc0017ee5a0?)
	/usr/local/go/src/sync/cond.go:56 +0x8c
github.com/docker/docker/pkg/ioutils.(*BytesPipe).Read(0xc0010c9b80, {0xc000e28ffd, 0x800, 0x1003})
	/go/src/github.com/docker/docker/pkg/ioutils/bytespipe.go:135 +0x8e
github.com/docker/docker/daemon/logger.(*Copier).copySrc(0xc0010c9c80, {0x562b522ac1ea, 0x6}, {0x7f67e829bef0, 0xc0010c9b80})
	/go/src/github.com/docker/docker/daemon/logger/copier.go:86 +0x205
created by github.com/docker/docker/daemon/logger.(*Copier).Run
	/go/src/github.com/docker/docker/daemon/logger/copier.go:48 +0x6c

The stack frame of Dockerd was captured at 2023-01-11 16:56:05, minus 5657 minutes is 2023-01-07 18:39:05, which is about the time when the process with lwpId 155 was stuck.

I don't know why this happens, what could be the cause of this problem? How should I investigate further?

Reproduce

Can not reproduce.

Expected behavior

Application-triggered write system call no longer get stuck.

docker version

Client: Docker Engine - Community
 Version:           20.10.18
 API version:       1.41
 Go version:        go1.18.6
 Git commit:        b40c2f6
 Built:             Thu Sep  8 23:11:45 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.18
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.6
  Git commit:       e42327a
  Built:            Thu Sep  8 23:09:37 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  compose: Docker Compose (Docker Inc., v2.10.2)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 12
  Running: 1
  Paused: 0
  Stopped: 11
 Images: 12
 Server Version: 20.10.18
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-125-generic
 Operating System: Ubuntu 20.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.36GiB
 Name: shared-product-2
 ID: BDEL:L6M6:RU2R:FEZQ:DPKW:SCEN:TTZG:J5CC:FWKY:UM5X:ZBFT:4QK5
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 60
  Goroutines: 59
  System Time: 2023-01-11T16:53:19.272620338+08:00
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Additional Info

goroutine-stacks-2023-01-11T165605+0800.log

tianshuang avatar Jan 11 '23 10:01 tianshuang

Now the write system call in this container is stuck. We can use this container to continue troubleshooting. This is the second time this problem has occurred recently.

tianshuang avatar Jan 11 '23 10:01 tianshuang

Could you also check if the host's kernel log contains anything helpful/unusual?

vvoland avatar Jan 11 '23 11:01 vvoland

Could you also check if the host's kernel log contains anything helpful/unusual?

There is no associated error log for the matching time point. Any suggested troubleshooting ideas?

tianshuang avatar Jan 11 '23 14:01 tianshuang

There are no goroutines blocked waiting on a condition variable inside ioutils.(*BytesPipe).Write so this isn't a case of that known deadlock bug. There are four goroutines inside fifo.(*fifo).Read which I assume are the copier goroutines for stdout and stderr for the one running container plus one exec. All four goroutines are blocked in IO wait so nothing appears amiss at the application level. My best explanation is that either the Go runtime is broken, or the kernel. And it just might be the kernel. (Related Launchpad issue.)

corhere avatar Jan 17 '23 16:01 corhere

Thanks, I will upgrade the kernel version and check if the problem persists.

tianshuang avatar Jan 18 '23 03:01 tianshuang