Make Cluster Manager correctly recognize newly added nodes
Helix based cluster manager does not dynamically recognize newly added nodes. It assumes that InstanceConfig change notifications come in only for nodes that it already knows of. When such notifications come in for newly added nodes, the cluster manager meets with a NPE, because it tries to dereference the return value from the hash map of existing nodes (which it expects to always return a non-null value).
At a high level, the way to handle this within onInstanceConfigNotification() is to look up the instance in question in the hash map, and if it is absent, then add it as a new entry in much the same way as instance additions are done during initialization.
This is the NPE met with on new node additions currently:
2017/11/30 21:43:24.903 ERROR [ZKExceptionHandler] [ZkClient-EventThread-56-zk-<host>:<port>] [ambry-frontend] [] exception in handling data-change. path: /Ambry-prod/CONFIGS/PARTICIPANT/<host>_<port>, listener: com.github.ambry.clustermap.HelixClusterManager$ClusterChangeHandler@2aee99b6
java.lang.NullPointerException
I see exceptions after running HelixBootstrapUpgradeTool on new static files.
[2017-12-20 15:31:59,172] ERROR exception in handling child-change. instance: localhost_20088, parentPath: /Ambry-Proto/CONFIGS/PARTICIPANT, listener: com.github.ambry.clustermap.HelixClusterManager$ClusterChangeHandler@6b4a2c09 (org.apache.helix.manager.zk.ZKExceptionHandler) java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) at com.github.ambry.clustermap.HelixClusterManager$ClusterChangeHandler.updateSealedStateOfReplicas(HelixClusterManager.java:368) at com.github.ambry.clustermap.HelixClusterManager$ClusterChangeHandler.onInstanceConfigChange(HelixClusterManager.java:332) at org.apache.helix.manager.zk.CallbackHandler.invoke(CallbackHandler.java:214) at org.apache.helix.manager.zk.CallbackHandler.enqueueTask(CallbackHandler.java:177) at org.apache.helix.manager.zk.CallbackHandler.handleChildChange(CallbackHandler.java:431) at org.I0Itec.zkclient.ZkClient$8.run(ZkClient.java:772) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
@zzmao is this related to this issue? If not, could you raise this as a separate issue and (describe in more details the command that you ran and so on)
It's easy to remove this NPE and add new node to map. However, I am not able to add new replicas of the new node.
The notification other node gets has no replicas info: localhost_17088, {HELIX_HOST=localhost, HELIX_PORT=17088, datacenter=dc1, rackId=1611}{/tmp/c={Replicas=, capacityInBytes=912680550402, diskState=AVAILABLE}}{SEALED=[]} This is weird because when I check zooKeeper, it has replica information.
Talked with @vgkholla , he said currently Helix in Ambry only supports down/up. It does't support add new.
Need to go through helix code to proposal the way to add new node on both frontend server and backend server.