maven-mvnd icon indicating copy to clipboard operation
maven-mvnd copied to clipboard

Build crashes due to timeout or deamon not responding in time

Open deradam opened this issue 4 years ago • 30 comments

Hi there,

when building a rather large project (395 maven modules), mvnd runs for few minutes and then radomly "crashes" with the following error.

Exception in thread "main" org.jboss.fuse.mvnd.common.DaemonException$StaleAddressException: Could not receive a message from the daemon.
	at org.jboss.fuse.mvnd.client.DaemonClientConnection.receive(DaemonClientConnection.java:107)
	at org.jboss.fuse.mvnd.client.DefaultClient.execute(DefaultClient.java:198)
	at org.jboss.fuse.mvnd.client.DefaultClient.main(DefaultClient.java:72)
Caused by: java.io.IOException: No message received within 3000ms, daemon may have crashed
	at org.jboss.fuse.mvnd.client.DaemonClientConnection.receive(DaemonClientConnection.java:100)
	... 2 more

However if I check the running processes, I can still see mvnd building the project in the background. Is there a setting for the timeout to the deamon process?

Thanks for the nice tool and your support!

deradam avatar Oct 27 '20 08:10 deradam

Could you please check the logs in $MVND_HOME/daemon?

ppalaga avatar Oct 27 '20 08:10 ppalaga

@ppalaga thanks for the quick reply. I check the logs and see the following.

10:24:36.388 E Error dispatching events
org.jboss.fuse.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message BuildEvent{projectId='backend-core', type=MojoStarted, display=':backend-core:org.apache.maven.plugins:maven-resources-plugin:2.6:resources {execution: default-resources}'} to '/127.0.0.1:62278'.
	at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
	at org.jboss.fuse.mvnd.daemon.Server.lambda$handle$2(Server.java:428)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
	... 2 common frames omitted

deradam avatar Oct 27 '20 09:10 deradam

What's your operating system? Is it always reproducible?

ppalaga avatar Oct 27 '20 10:10 ppalaga

➜  ~ mvnd -version
Maven Daemon 0.0.10-darwin-amd64 (native)
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /usr/local/Cellar/mvnd/0.0.10/libexec/mvn
Java version: 1.8.0_265, vendor: AdoptOpenJDK, runtime: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "mac os x", version: "10.15.7", arch: "x86_64", family: "mac"

Funny enough, 10 minutes later ->

10:34:40.119 I Build finished, finishing message dispatch
10:34:40.120 I Daemon back to idle
10:34:40.120 I Updating state to: Idle
10:34:42.366 I Waiting for request
10:34:42.366 D daemon is running. Sleeping until state changes.

And yes, it is reproducible.

deradam avatar Oct 27 '20 11:10 deradam

So the daemon finishes building the request, the problem is only that the client disconnects before it is finished?

ppalaga avatar Oct 27 '20 12:10 ppalaga

So the daemon finishes building the request, the problem is only that the client disconnects before it is finished?

Yes, I think that's correct. The client disconnects because it has not received any message (including the keep-alive ones) for 3 seconds. That's a new behaviour for the keep-alive. Two properties can be used to raise the keep alive timeout (daemon.keepAliveMs) and the multiplier after which the client disconnects (daemon.maxLostKeepAlive). You may want to raise them, but I'm not sure how this would change anything. I suspect something wrong happened in the dispatch loop of the daemon. Can you check if you have anything in the daemon log after the last "Dispatch message" event.

gnodet avatar Oct 27 '20 12:10 gnodet

Okay, indeed it is running if I add the following parameter -Ddaemon.keepAliveMs=10000. I guess I can add it somewhere "globally" so I do not need to add this with each and every call?

deradam avatar Oct 27 '20 17:10 deradam

I've been able to reproduce the problem, but in my case, it was a follow-up of #155 : when building the java doc, the daemon try to send a big message, fails with an exception, thus the client fail 3s later.

gnodet avatar Oct 28 '20 12:10 gnodet

@deradam the daemon.keepAliveMs also effects the client refresh rate in case there are no other messages, so you may prefer keeping daemon.keepAliveMs=1000 and raise daemon.maxLostKeepAlive=10.

The global properties location is ~/.m2/mvnd.properties.

gnodet avatar Oct 28 '20 14:10 gnodet

@deradam If that's easily reproductible, would you mind checking if you have an error in the daemon log prior to the one you pasted above (which is the result of the client dropping the connection) ? In my case, I had the one in #155, so I wonder if that's the same for you.

gnodet avatar Oct 28 '20 14:10 gnodet

Yes, I'm on it. However I am fighting with the overall build now :/

deradam avatar Oct 29 '20 08:10 deradam

Hi @gnodet,

so this is the error that I am always getting when building the project.

10:01:18.565 I Dispatch message: KeepAliveMessage{}
10:01:18.566 I Dispatch message: BuildEvent{projectId='backend-core', type=MojoStopped, display=':backend-core:org.apache.cxf:cxf-codegen-plugin:3.2.8:wsdl2java {execution: generate-sources-france-routing-ws}'}
10:01:18.567 E Error dispatching events
org.jboss.fuse.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message BuildEvent{projectId='backend-core', type=MojoStopped, display=':backend-core:org.apache.cxf:cxf-codegen-plugin:3.2.8:wsdl2java {execution: generate-sources-routing-ws}'} to '/127.0.0.1:59959'.
	at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
	at org.jboss.fuse.mvnd.daemon.Server.lambda$handle$2(Server.java:428)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Broken pipe
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
	at org.jboss.fuse.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.jboss.fuse.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
	... 2 common frames omitted

As you can see, it "breaks" when maven is calling wsdl2java which seem to run for quite some time. I don't know if this helps you in any way. At least I can overcome this with the properties proposed above.

Thanks for your support!

deradam avatar Nov 02 '20 09:11 deradam

hi together,

unfortunately I have the same problem. I have already increased the values for the keepAliveMs and maxLostKeepAlive:

daemon.keepAliveMs=10000 daemon.maxLostKeepAlive=10

Problem occurs in multi-thread as well as with only one thread mode... I have also already tried to increase the heap... without success so far!

Following system infos and the error log:

Terminal: org.jline.terminal.impl.PosixSysTerminal with pty org.jline.terminal.impl.jansi.linux.LinuxNativePty
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /home/6024/.sdkman/candidates/mvnd/0.4.1/mvn
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /home/ubuntu_entwshare/develop/java/jdk1.8.0_191/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "linux", version: "5.4.0-55-generic", arch: "amd64", family: "unix"
11:41:05.076 I Dispatch message: KeepAlive
11:41:05.176 I Dispatch message: KeepAlive
11:41:05.276 I Dispatch message: KeepAlive
11:41:05.376 I Dispatch message: KeepAlive
11:41:05.477 I Dispatch message: KeepAlive
11:41:05.577 I Dispatch message: KeepAlive
11:41:05.677 I Dispatch message: KeepAlive
11:41:05.777 I Dispatch message: KeepAlive
11:41:05.877 I Dispatch message: KeepAlive
11:41:05.978 I Dispatch message: KeepAlive
11:41:06.244 I Dispatch message: KeepAlive
11:41:06.345 I Dispatch message: KeepAlive
11:41:06.445 I Dispatch message: KeepAlive
11:41:06.545 I Dispatch message: KeepAlive
11:41:11.179 I Dispatch message: KeepAlive
11:41:11.280 I Dispatch message: KeepAlive
11:41:11.280 E Error dispatching events
org.mvndaemon.mvnd.common.DaemonException$RecoverableMessageIOException: Could not write message KeepAlive to '/127.0.0.1:59906'.
        at org.mvndaemon.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:116)
        at org.mvndaemon.mvnd.daemon.Server.lambda$handle$3(Server.java:460)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Datenübergabe unterbrochen (broken pipe)
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.writeWithNonBlockingRetry(DaemonConnection.java:275)
        at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.writeBufferToChannel(DaemonConnection.java:263)
        at org.mvndaemon.mvnd.common.DaemonConnection$SocketOutputStream.flush(DaemonConnection.java:257)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at org.mvndaemon.mvnd.common.DaemonConnection.dispatch(DaemonConnection.java:113)
        ... 2 common frames omitted

Do you have any ideas?

DoPaNik avatar Jan 26 '21 11:01 DoPaNik

Any updates?

DoPaNik avatar Feb 22 '21 08:02 DoPaNik

@DoPaNik seems to be like https://github.com/mvndaemon/mvnd/issues/291 Tweaking the props should be right strategy, but the ones you mention are either not properly spelled or have weird values. As step one, I'd recommend increasing mvnd.maxLostKeepAlive from its default 30 to 60, 120, etc.

daemon.keepAliveMs does not exist in mvnd 0.4.1, the correct name is mvnd.keepAlive.

There is a pause of 3+seconds between the following two messages in your log:

11:41:06.545 I Dispatch message: KeepAlive
11:41:11.179 I Dispatch message: KeepAlive

Given your config, tt should be around 100ms.

It would be interesting to figure what's the cause. You may want to attach jconsole or similar to the daemon process to see whether it is GC or something else.

ppalaga avatar Feb 22 '21 08:02 ppalaga

I've also hit the problem a few times under heavy cpu load. I wonder if we should slightly raise a bit the default values.

gnodet avatar Feb 22 '21 08:02 gnodet

I think we should try to figure out first where the bottle neck is. Is that overall CPU contention? Or is that client threads blocking each other? Daemon threads blocking each other? Not all of those potential issues would necessarily be solved by increasing mvnd.maxLostKeepAlive

ppalaga avatar Feb 22 '21 08:02 ppalaga

I think we should try to figure out first where the bottle neck is. Is that overall CPU contention? Or is that client threads blocking each other? Daemon threads blocking each other? Not all of those potential issues would necessarily be solved by increasing mvnd.maxLostKeepAlive

This is most probably something like a big GC pause under heavy CPU usage. I really don't think this comes from blocking threads. The threading model is not that complicated and this would happen more often I think.

In addition, if there's a deadlock, raising the timeout will definitely not help. So the opposite is true: raising the timeout will remove possible problems with GC/high CPU usage why not hiding other problems, so it looks quite safe to do it anyway...

gnodet avatar Feb 22 '21 09:02 gnodet

Thank you for the tips. Now I have set the mvnd.properties to the following values:

mvnd.keepAlive=10000
mvnd.maxLostKeepAlive=120

This looks much better at first glance. But the build breaks off after about 4 minutes with the following error:

[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  04:29 min (Wall Clock)
[INFO] Finished at: 2021-02-22T11:13:00+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project base: Compilation failure
[ERROR] An unknown compilation problem occurred
[ERROR] 
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project base: Compilation failure
An unknown compilation problem occurred

 

        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:215)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
        at org.mvndaemon.mvnd.builder.SmartBuilderImpl.buildProject(SmartBuilderImpl.java:178)
        at org.mvndaemon.mvnd.builder.SmartBuilderImpl$ProjectBuildTask.run(SmartBuilderImpl.java:198)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.maven.plugin.compiler.CompilationFailureException: Compilation failure
An unknown compilation problem occurred

 

        at org.apache.maven.plugin.compiler.AbstractCompilerMojo.execute(AbstractCompilerMojo.java:1220)
        at org.apache.maven.plugin.compiler.CompilerMojo.execute(CompilerMojo.java:187)
        at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
        ... 10 common frames omitted
[ERROR] 
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :base

Looks like a heap size issue to me. Unfortunately, I can't connect to the jvm via jconsole at the moment. I will simply increase the heap size.

Is there a hack to start with the jconsole? I simply have tried it via jconsole UI to connect via the local connection dialog.

DoPaNik avatar Feb 22 '21 10:02 DoPaNik

Looks like a heap size issue to me.

Have you tried increasing the heap via mvnd.maxHeapSize ?

I simply have tried it via jconsole UI to connect via the local connection dialog.

That's what I usually do. You should see the daemon process in the list. You may want to start the daemon eagerly via mvnd --version.

ppalaga avatar Feb 22 '21 10:02 ppalaga

This is most probably something like a big GC pause under heavy CPU usage.

Yes, I also see it as the main suspect :) But it would still be nice to get a witness who sees a GC happening at that time via some monitoring tool.

By increasing the defaults we just move the threshold for the client disconnect, but we do not help much those poor devs whose build is still bigger than that new threshold. Maybe we should try harder on the client side to figure out what is happening with the daemon. If the process is still up, we could perhaps prolong the grace period, but output some sort of a warning that something unusual may be going on. Also we should give a hint which particular config options should be tweaked if the user thinks the situation is normal.

And finally, if somebody really sees GC as the root cause, it would be extremely interesting to see whether using shenandoah (or any other low pause gc) would help.

ppalaga avatar Feb 22 '21 10:02 ppalaga

Have you tried increasing the heap via mvnd.maxHeapSize ?

Yep. I have set the heap to 8G, Which should be enough. Unfortunately still the same error. I just started again with 16 - that should be way too much, but then we can exclude this as problem

DoPaNik avatar Feb 22 '21 11:02 DoPaNik

Yes, I also see it as the main suspect :) But it would still be nice to get a witness who sees a GC happening at that time via some monitoring tool.

Of course. I try to collect data and then post them.

DoPaNik avatar Feb 22 '21 11:02 DoPaNik

Fun fact: With 16G started build is still running ... daemon: 1b1999da threads used/hidden/max: 1/0/11 progress: 6/16 37% time: 82:02

A normal maven build takes about 30 min.

I am trying to provide data via jconsole as soon as possible

DoPaNik avatar Feb 22 '21 12:02 DoPaNik

I also wonder if the messages ordering on the server side isn't problematic: https://github.com/mvndaemon/mvnd/blob/5fa1f3bd2f4a80245e06562078946d2af8703c66/daemon/src/main/java/org/mvndaemon/mvnd/daemon/Server.java#L596-L597 The keep alive messages are low priority, so it could happen that they are almost never sent to the client. The ordering could also be the problem for https://github.com/mvndaemon/mvnd/issues/362

gnodet avatar Feb 22 '21 19:02 gnodet

@ppalaga as promised, albeit a bit late, here is the output from jconsole: jconsole_cpu.txt jconsole_heap.txt jconsole_thread.txt Bildschirmfoto vom 2021-03-16 10-49-00

After I set the heap to 8 gb, the build runs without errors. But apparently mvnd had not done anything for a long time and waited in the middle of the run....

DoPaNik avatar Mar 16 '21 10:03 DoPaNik

Raise the keep-alive timeout seems to no effects now. https://github.com/lanmaoxinqing/mvnd/runs/2902075523?check_suite_focus=true#step:7:1100

lanmaoxinqing avatar Jun 24 '21 06:06 lanmaoxinqing

Looks like a heap size issue to me.

Have you tried increasing the heap via mvnd.maxHeapSize ?

@ppalaga i run into a timeout issue upon today too. In my case it was a memory issue. Would it be possible for mvnd to detect memory pressure and report it? It took me some time to connect timeout reported by the CLI side with mvnd memory settings.

FWIW, in Trino, we use https://github.com/airlift/airlift/blob/7e987a2c7f55c1703b9ea383daa32c657d29b7dd/stats/src/main/java/io/airlift/stats/JmxGcMonitor.java#L77 to report GC pauses.

findepi avatar Jan 18 '22 11:01 findepi

Looks like a heap size issue to me.

Have you tried increasing the heap via mvnd.maxHeapSize ?

@ppalaga i run into a timeout issue upon today too. In my case it was a memory issue. Would it be possible for mvnd to detect memory pressure and report it? It took me some time to connect timeout reported by the CLI side with mvnd memory settings.

It's already supposed to : https://github.com/apache/maven-mvnd/blob/master/daemon/src/main/java/org/mvndaemon/mvnd/daemon/DaemonExpiration.java It may be broken somehow...

FWIW, in Trino, we use https://github.com/airlift/airlift/blob/7e987a2c7f55c1703b9ea383daa32c657d29b7dd/stats/src/main/java/io/airlift/stats/JmxGcMonitor.java#L77 to report GC pauses.

gnodet avatar Jan 18 '22 12:01 gnodet

Is this still an issue?

delanym avatar Feb 13 '23 09:02 delanym