Payara icon indicating copy to clipboard operation
Payara copied to clipboard

Bug Report: payara micro hzinitialjoinwait option has no effect/FISH-7372

Open maxencelaurent opened this issue 2 years ago • 1 comments

Brief Summary

Setting the hazelcast.wait.seconds.before.join property has no effect (either with --hzinitialjoinwait X or -Dhazelcast.wait.seconds.before.join=X)

The value eventually used by hazelcast is always 1.

Expected Outcome

The value used by hazelcast should be the one provided on the command line.

Current Outcome

Hazelcast always use a hardcoded value of 1

Reproducer

Reproduce

Start a single payara-micro instance. Set clusterMode to dns and set hzinitialjoinwait to 30

java -jar ./payara-micro-6.2023.4.jar --clusterMode dns:localhost:6900 --hzinitialjoinwait 30

When the instance tries to join a hazelcast cluster, it should try for 30 seconds before creating a new one. This is not the case: it just tries for 1 second.

In hazelcast, the piece of code that is waiting is DiscoveryJoiner lines 66-82

The 30 seconds delay should be visible in the logs:

[...]

[2023-04-21T10:16:53.399+0200] [] [INFO] [] [com.hazelcast.system.logo] [tid: _ThreadID=25 _ThreadName=RunLevelControllerThread-1682065012770] [timeMillis: 1682065013399] [levelValue: 800] [[
  [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 [development] [5.2.2] 
	+       +  o    o     o     o---o o----o o      o---o     o     o----o o--o--o
	+ +   + +  |    |    / \       /  |      |     /         / \    |         |   
	+ + + + +  o----o   o   o     o   o----o |    o         o   o   o----o    |   
	+ +   + +  |    |  /     \   /    |      |     \       /     \       |    |   
	+       +  o    o o       o o---o o----o o----o o---o o       o o----o    o   ]]

[...]

[2023-04-21T10:16:53.909+0200] [] [INFO] [] [com.hazelcast.core.LifecycleService] [tid: _ThreadID=25 _ThreadName=RunLevelControllerThread-1682065012770] [timeMillis: 1682065013909] [levelValue: 800] [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 [development] [5.2.2] [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 is STARTING


[ !!! expect a 30 seconds delay before next log statement !!! ]


[2023-04-21T10:16:55.058+0200] [] [INFO] [] [com.hazelcast.internal.cluster.ClusterService] [tid: _ThreadID=25 _ThreadName=RunLevelControllerThread-1682065012770] [timeMillis: 1682065015058] [levelValue: 800] [[
  [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 [development] [5.2.2] 

Members {size:1, ver:1} [
	Member [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 - e97b06d8-9ffb-411f-bc6a-aed91ece2f41 this
]
]]

[...]

Details

Payara micro bootstrap

First, fish.payara.micro.impl.PayaraMicroImpl#configureHazelcast ensures the system property hazelcast.wait.seconds.before.join is set. If this is not the case, the property is initialized according to the hzinitialjoinwait option. If no option was provided, property is set to "1". This step looks fine.

Hazelcast bootstrap

Later, fish.payara.nucleus.hazelcast.HazelcastCore#buildConfiguration creates the hazelcast config object. On line 394 WAIT_SECONDS_BEFORE_JOIN property is set to "1", hiding the previously set system property.

Hot fix test

Start and debug a single payara micro instance:

java -Xdebug -Xrunjdwp:transport=dt_socket,address=9009,server=y,suspend=y -jar ./payara-micro-6.2023.4.jar --clusterMode dns:localhost:6900 --hzinitialjoinwait 30

Break on HazelcastCore:396

Revert the previous statement by executing config.getProperties().remove(WAIT_SECONDS_BEFORE_JOIN.getName())

Continue code execution : instance will try to join an existing cluster for 30 seconds.

[...]
[2023-04-21T10:18:34.766+0200] [] [INFO] [] [com.hazelcast.system.logo] [tid: _ThreadID=26 _ThreadName=RunLevelControllerThread-1682065109743] [timeMillis: 1682065114766] [levelValue: 800] [[
  [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 [development] [5.2.2] 
	+       +  o    o     o     o---o o----o o      o---o     o     o----o o--o--o
	+ +   + +  |    |    / \       /  |      |     /         / \    |         |   
	+ + + + +  o----o   o   o     o   o----o |    o         o   o   o----o    |   
	+ +   + +  |    |  /     \   /    |      |     \       /     \       |    |   
	+       +  o    o o       o o---o o----o o----o o---o o       o o----o    o   ]]

[...]

[2023-04-21T10:18:35.301+0200] [] [INFO] [] [com.hazelcast.core.LifecycleService] [tid: _ThreadID=26 _ThreadName=RunLevelControllerThread-1682065109743] [timeMillis: 1682065115301] [levelValue: 800] [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 [development] [5.2.2] [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 is STARTING



[ ! 30 seconds delay is here ! ]



[2023-04-21T10:19:05.474+0200] [] [INFO] [] [com.hazelcast.internal.cluster.ClusterService] [tid: _ThreadID=26 _ThreadName=RunLevelControllerThread-1682065109743] [timeMillis: 1682065145474] [levelValue: 800] [[
  [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 [development] [5.2.2] 

Members {size:1, ver:1} [
	Member [fe80:0:0:0:9829:46ff:fef2:3daa]:6900 - c48b0915-c77a-466a-90d8-77f723c44353 this
]
]]

Solution

It seems the reason the line in HazelcastCore exists is to prevent setting the property to "0".

                // can't quite set it to zero yet because of:
                // https://github.com/hazelcast/hazelcast/issues/17586
                config.setProperty(WAIT_SECONDS_BEFORE_JOIN.getName(), "1");

In the scope of payara-micro, this is useless as payara-micro-bootstrap already defaults the property to "1". I do not know what happen with other payara flavours.

One solution would be to set the property only to prevent the use of "0" with something like

String initialJoinWaitProperty = System.getProperty("hazelcast.wait.seconds.before.join")

if (initialJoinWaitProperty == null || Integer.parseInt(initialJoinWaitProperty) < 1) {
    config.setProperty(WAIT_SECONDS_BEFORE_JOIN.getName(), "1");
}

Operating System

Linux

JDK Version

OpenJDK 11

Payara Distribution

Payara Micro

maxencelaurent avatar Apr 21 '23 09:04 maxencelaurent

Hello @maxencelaurent!

Thank you for reporting this bug and for providing such a detailed reproducer, analysis and possible solution! I have raised an internal issue with the internal code FISH-7372, in which I related the information on this issue. As you already found a possible solution, you could send a PR which is going to be reviewed by a member of our team. This would greatly speed up the resolution of this issue.

felixif avatar May 04 '23 14:05 felixif