maven-mvnd
maven-mvnd copied to clipboard
Build crashes due to timeout or deamon not responding in time
Hi there,
when building a rather large project (395 maven modules), mvnd runs for few minutes and then radomly "crashes" with the following error.
Exception in thread "main" org.jboss.fuse.mvnd.common.DaemonException$StaleAddressException: Could not receive a message from the daemon.
at org.jboss.fuse.mvnd.client.DaemonClientConnection.receive(DaemonClientConnection.java:107)
at org.jboss.fuse.mvnd.client.DefaultClient.execute(DefaultClient.java:198)
at org.jboss.fuse.mvnd.client.DefaultClient.main(DefaultClient.java:72)
Caused by: java.io.IOException: No message received within 3000ms, daemon may have crashed
at org.jboss.fuse.mvnd.client.DaemonClientConnection.receive(DaemonClientConnection.java:100)
... 2 more
However if I check the running processes, I can still see mvnd building the project in the background. Is there a setting for the timeout to the deamon process?
Thanks for the nice tool and your support!
Could you please check the logs in $MVND_HOME/daemon
?
@ppalaga thanks for the quick reply. I check the logs and see the following.
10:24:36.388 E Error dispatching events
org.jboss.fuse.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message BuildEvent{projectId='backend-core', type=MojoStarted, display=':backend-core:org.apache.maven.plugins:maven-resources-plugin:2.6:resources {execution: default-resources}'} to '/127.0.0.1:62278'.
at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
at org.jboss.fuse.mvnd.daemon.Server.lambda$handle$2(Server.java:428)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
... 2 common frames omitted
What's your operating system? Is it always reproducible?
➜ ~ mvnd -version
Maven Daemon 0.0.10-darwin-amd64 (native)
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /usr/local/Cellar/mvnd/0.0.10/libexec/mvn
Java version: 1.8.0_265, vendor: AdoptOpenJDK, runtime: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "mac os x", version: "10.15.7", arch: "x86_64", family: "mac"
Funny enough, 10 minutes later ->
10:34:40.119 I Build finished, finishing message dispatch
10:34:40.120 I Daemon back to idle
10:34:40.120 I Updating state to: Idle
10:34:42.366 I Waiting for request
10:34:42.366 D daemon is running. Sleeping until state changes.
And yes, it is reproducible.
So the daemon finishes building the request, the problem is only that the client disconnects before it is finished?
So the daemon finishes building the request, the problem is only that the client disconnects before it is finished?
Yes, I think that's correct. The client disconnects because it has not received any message (including the keep-alive ones) for 3 seconds. That's a new behaviour for the keep-alive.
Two properties can be used to raise the keep alive timeout (daemon.keepAliveMs
) and the multiplier after which the client disconnects (daemon.maxLostKeepAlive
). You may want to raise them, but I'm not sure how this would change anything.
I suspect something wrong happened in the dispatch loop of the daemon.
Can you check if you have anything in the daemon log after the last "Dispatch message" event.
Okay, indeed it is running if I add the following parameter -Ddaemon.keepAliveMs=10000
. I guess I can add it somewhere "globally" so I do not need to add this with each and every call?
I've been able to reproduce the problem, but in my case, it was a follow-up of #155 : when building the java doc, the daemon try to send a big message, fails with an exception, thus the client fail 3s later.
@deradam the daemon.keepAliveMs
also effects the client refresh rate in case there are no other messages, so you may prefer keeping daemon.keepAliveMs=1000
and raise daemon.maxLostKeepAlive=10
.
The global properties location is ~/.m2/mvnd.properties
.
@deradam If that's easily reproductible, would you mind checking if you have an error in the daemon log prior to the one you pasted above (which is the result of the client dropping the connection) ? In my case, I had the one in #155, so I wonder if that's the same for you.
Yes, I'm on it. However I am fighting with the overall build now :/
Hi @gnodet,
so this is the error that I am always getting when building the project.
10:01:18.565 I Dispatch message: KeepAliveMessage{}
10:01:18.566 I Dispatch message: BuildEvent{projectId='backend-core', type=MojoStopped, display=':backend-core:org.apache.cxf:cxf-codegen-plugin:3.2.8:wsdl2java {execution: generate-sources-france-routing-ws}'}
10:01:18.567 E Error dispatching events
org.jboss.fuse.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message BuildEvent{projectId='backend-core', type=MojoStopped, display=':backend-core:org.apache.cxf:cxf-codegen-plugin:3.2.8:wsdl2java {execution: generate-sources-routing-ws}'} to '/127.0.0.1:59959'.
at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
at org.jboss.fuse.mvnd.daemon.Server.lambda$handle$2(Server.java:428)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
... 2 common frames omitted
As you can see, it "breaks" when maven is calling wsdl2java
which seem to run for quite some time. I don't know if this helps you in any way. At least I can overcome this with the properties proposed above.
Thanks for your support!
hi together,
unfortunately I have the same problem. I have already increased the values for the keepAliveMs and maxLostKeepAlive:
daemon.keepAliveMs=10000
daemon.maxLostKeepAlive=10
Problem occurs in multi-thread as well as with only one thread mode... I have also already tried to increase the heap... without success so far!
Following system infos and the error log:
Terminal: org.jline.terminal.impl.PosixSysTerminal with pty org.jline.terminal.impl.jansi.linux.LinuxNativePty
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /home/6024/.sdkman/candidates/mvnd/0.4.1/mvn
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /home/ubuntu_entwshare/develop/java/jdk1.8.0_191/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-55-generic", arch: "amd64", family: "unix"
11:41:05.076 I Dispatch message: KeepAlive
11:41:05.176 I Dispatch message: KeepAlive
11:41:05.276 I Dispatch message: KeepAlive
11:41:05.376 I Dispatch message: KeepAlive
11:41:05.477 I Dispatch message: KeepAlive
11:41:05.577 I Dispatch message: KeepAlive
11:41:05.677 I Dispatch message: KeepAlive
11:41:05.777 I Dispatch message: KeepAlive
11:41:05.877 I Dispatch message: KeepAlive
11:41:05.978 I Dispatch message: KeepAlive
11:41:06.244 I Dispatch message: KeepAlive
11:41:06.345 I Dispatch message: KeepAlive
11:41:06.445 I Dispatch message: KeepAlive
11:41:06.545 I Dispatch message: KeepAlive
11:41:11.179 I Dispatch message: KeepAlive
11:41:11.280 I Dispatch message: KeepAlive
11:41:11.280 E Error dispatching events
org.mvndaemon.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message KeepAlive to '/127.0.0.1:59906'.
at org.mvndaemon.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
at org.mvndaemon.mvnd.daemon.Server.lambda$handle$3(Server.java:460)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Datenübergabe unterbrochen (broken pipe)
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.mvndaemon.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
... 2 common frames omitted
Do you have any ideas?
Any updates?
@DoPaNik seems to be like https://github.com/mvndaemon/mvnd/issues/291 Tweaking the props should be right strategy, but the ones you mention are either not properly spelled or have weird values. As step one, I'd recommend increasing mvnd.maxLostKeepAlive
from its default 30
to 60
, 120
, etc.
daemon.keepAliveMs
does not exist in mvnd 0.4.1, the correct name is mvnd.keepAlive
.
There is a pause of 3+seconds between the following two messages in your log:
11:41:06.545 I Dispatch message: KeepAlive
11:41:11.179 I Dispatch message: KeepAlive
Given your config, tt should be around 100ms.
It would be interesting to figure what's the cause. You may want to attach jconsole or similar to the daemon process to see whether it is GC or something else.
I've also hit the problem a few times under heavy cpu load. I wonder if we should slightly raise a bit the default values.
I think we should try to figure out first where the bottle neck is. Is that overall CPU contention? Or is that client threads blocking each other? Daemon threads blocking each other? Not all of those potential issues would necessarily be solved by increasing mvnd.maxLostKeepAlive
I think we should try to figure out first where the bottle neck is. Is that overall CPU contention? Or is that client threads blocking each other? Daemon threads blocking each other? Not all of those potential issues would necessarily be solved by increasing mvnd.maxLostKeepAlive
This is most probably something like a big GC pause under heavy CPU usage. I really don't think this comes from blocking threads. The threading model is not that complicated and this would happen more often I think.
In addition, if there's a deadlock, raising the timeout will definitely not help. So the opposite is true: raising the timeout will remove possible problems with GC/high CPU usage why not hiding other problems, so it looks quite safe to do it anyway...
Thank you for the tips. Now I have set the mvnd.properties to the following values:
mvnd.keepAlive=10000
mvnd.maxLostKeepAlive=120
This looks much better at first glance. But the build breaks off after about 4 minutes with the following error:
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:29 min (Wall Clock)
[INFO] Finished at: 2021-02-22T11:13:00+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project base: Compilation failure
[ERROR] An unknown compilation problem occurred
[ERROR]
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project base: Compilation failure
An unknown compilation problem occurred
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:215)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
at org.mvndaemon.mvnd.builder.SmartBuilderImpl.buildProject(SmartBuilderImpl.java:178)
at org.mvndaemon.mvnd.builder.SmartBuilderImpl$ProjectBuildTask.run(SmartBuilderImpl.java:198)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.maven.plugin.compiler.CompilationFailureException: Compilation failure
An unknown compilation problem occurred
at org.apache.maven.plugin.compiler.AbstractCompilerMojo.execute(AbstractCompilerMojo.java:1220)
at org.apache.maven.plugin.compiler.CompilerMojo.execute(CompilerMojo.java:187)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
... 10 common frames omitted
[ERROR]
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <args> -rf :base
Looks like a heap size issue to me. Unfortunately, I can't connect to the jvm via jconsole at the moment. I will simply increase the heap size.
Is there a hack to start with the jconsole? I simply have tried it via jconsole UI to connect via the local connection dialog.
Looks like a heap size issue to me.
Have you tried increasing the heap via mvnd.maxHeapSize ?
I simply have tried it via jconsole UI to connect via the local connection dialog.
That's what I usually do. You should see the daemon process in the list. You may want to start the daemon eagerly via mvnd --version
.
This is most probably something like a big GC pause under heavy CPU usage.
Yes, I also see it as the main suspect :) But it would still be nice to get a witness who sees a GC happening at that time via some monitoring tool.
By increasing the defaults we just move the threshold for the client disconnect, but we do not help much those poor devs whose build is still bigger than that new threshold. Maybe we should try harder on the client side to figure out what is happening with the daemon. If the process is still up, we could perhaps prolong the grace period, but output some sort of a warning that something unusual may be going on. Also we should give a hint which particular config options should be tweaked if the user thinks the situation is normal.
And finally, if somebody really sees GC as the root cause, it would be extremely interesting to see whether using shenandoah (or any other low pause gc) would help.
Have you tried increasing the heap via mvnd.maxHeapSize ?
Yep. I have set the heap to 8G, Which should be enough. Unfortunately still the same error. I just started again with 16 - that should be way too much, but then we can exclude this as problem
Yes, I also see it as the main suspect :) But it would still be nice to get a witness who sees a GC happening at that time via some monitoring tool.
Of course. I try to collect data and then post them.
Fun fact: With 16G started build is still running ...
daemon: 1b1999da threads used/hidden/max: 1/0/11 progress: 6/16 37% time: 82:02
A normal maven build takes about 30 min.
I am trying to provide data via jconsole as soon as possible
I also wonder if the messages ordering on the server side isn't problematic: https://github.com/mvndaemon/mvnd/blob/5fa1f3bd2f4a80245e06562078946d2af8703c66/daemon/src/main/java/org/mvndaemon/mvnd/daemon/Server.java#L596-L597 The keep alive messages are low priority, so it could happen that they are almost never sent to the client. The ordering could also be the problem for https://github.com/mvndaemon/mvnd/issues/362
@ppalaga as promised, albeit a bit late, here is the output from jconsole:
jconsole_cpu.txt
jconsole_heap.txt
jconsole_thread.txt
After I set the heap to 8 gb, the build runs without errors. But apparently mvnd had not done anything for a long time and waited in the middle of the run....
Raise the keep-alive timeout seems to no effects now. https://github.com/lanmaoxinqing/mvnd/runs/2902075523?check_suite_focus=true#step:7:1100
Looks like a heap size issue to me.
Have you tried increasing the heap via mvnd.maxHeapSize ?
@ppalaga i run into a timeout issue upon today too. In my case it was a memory issue. Would it be possible for mvnd to detect memory pressure and report it? It took me some time to connect timeout reported by the CLI side with mvnd memory settings.
FWIW, in Trino, we use https://github.com/airlift/airlift/blob/7e987a2c7f55c1703b9ea383daa32c657d29b7dd/stats/src/main/java/io/airlift/stats/JmxGcMonitor.java#L77 to report GC pauses.
Looks like a heap size issue to me.
Have you tried increasing the heap via mvnd.maxHeapSize ?
@ppalaga i run into a timeout issue upon today too. In my case it was a memory issue. Would it be possible for mvnd to detect memory pressure and report it? It took me some time to connect timeout reported by the CLI side with mvnd memory settings.
It's already supposed to : https://github.com/apache/maven-mvnd/blob/master/daemon/src/main/java/org/mvndaemon/mvnd/daemon/DaemonExpiration.java It may be broken somehow...
FWIW, in Trino, we use https://github.com/airlift/airlift/blob/7e987a2c7f55c1703b9ea383daa32c657d29b7dd/stats/src/main/java/io/airlift/stats/JmxGcMonitor.java#L77 to report GC pauses.
Is this still an issue?