Broker cannot build resource map if there are any server with invalid configs
After servers with invalid configuration joined the cluster, e.g. like below
{ "id": "<some_id>", "simpleFields": { "HELIX_HOST": "<some_host>", "HELIX_PORT": "" }, "mapFields": {}, "listFields": {} }
The broker (even in other tenants) cannot build brokerResource correctly. The entire cluster cannot server any queries any more. It would raise "410 BrokerResourceMissingError".
Following noticeable error might appear in broker log
java.lang.NullPointerException: Cannot invoke "java.util.Set.contains(Object)" because "this._enabledInstances" is null at o.a.p.b.r.i.BaseInstanceSelector.getEnabledCandidatesAndAddToServingInstances(BaseInstanceSelector.java:338) at o.a.p.b.r.i.BaseInstanceSelector.refreshSegmentStates(BaseInstanceSelector.java:294) at o.a.p.b.r.i.BaseInstanceSelector.init(BaseInstanceSelector.java:117) at o.a.p.b.r.i.BalancedInstanceSelector.init(BalancedInstanceSelector.java:50) at o.a.p.b.r.BrokerRoutingManager.buildRouting(BrokerRoutingManager.java:450) at o.a.p.b.b.h.BrokerResourceOnlineOfflineStateModelFactory$BrokerResourceOnlineOfflineStateModel.onBecomeOnlineFromOffline(BrokerResourceOnlineOfflineStateModelFactory.java:80) at j.i.r.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) at java.lang.reflect.Method.invoke(Method.java:580) at o.a.h.m.h.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:350) at o.a.h.m.h.HelixStateTransitionHandler.h...
which indicates that this._enabledInstances cannot be initialized BaseInstanceSelector. This Set object is initialized in following method
@Override public void init(Set<String> enabledInstances, IdealState idealState, ExternalView externalView, Set<String> onlineSegments) { _enabledInstances = enabledInstances; Map<String, Long> newSegmentCreationTimeMap = getNewSegmentCreationTimeMapFromZK(idealState, externalView, onlineSegments); updateSegmentMaps(idealState, externalView, onlineSegments, newSegmentCreationTimeMap); refreshSegmentStates(); }
And it is indirectly called by BrokerRoutingManager.processInstanceConfigChange()
Once any exception raised in the method before calling _routableServers = enabledServers; the exception might be caught and leave the Set as null.
Taking the above server config as an example, the broker would have following exception
java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.base/java.lang.Integer.parseInt(Integer.java:662) at java.base/java.lang.Integer.parseInt(Integer.java:770) at org.apache.pinot.core.transport.ServerInstance.<init>(ServerInstance.java:63) at org.apache.pinot.broker.routing.BrokerRoutingManager.processInstanceConfigChange(BrokerRoutingManager.java:245) at org.apache.pinot.broker.routing.BrokerRoutingManager.processClusterChange(BrokerRoutingManager.java:133) at org.apache.pinot.broker.broker.helix.ClusterChangeMediator.processClusterChange(ClusterChangeMediator.java:134) at org.apache.pinot.broker.broker.helix.ClusterChangeMediator.lambda$new$0(ClusterChangeMediator.java:96) at java.base/java.lang.Thread.run(Thread.java:829)
and skip/return in advance.
Such behavior is very dangerous and problematic for a distributed system. A single participant instance failure would cause the entire cluster down. Following items might be required:
- Once updating server configs through controller API, sanity check and enforcement need to be in place.
- During initialization of brokers, safely created each not-to-be null object in constructor. Once constructing mapping across servers, safely isolate the bad configs and ensure functionality of good candidates.
Do you run into this problem after updating server config via controller API, or directly modify Zookeeper? Seems like the ZNode becomes an invalid JSON, thus causing the problem. It is very hard to make the cluster handle invalid ZNode. E.g. if you have one ideal state that is invalid JSON, the whole Helix framework will stuck.
If the modification is through regular controller API, then we should add a check to ensure it is valid JSON.
Do you run into this problem after updating server config via controller API, or directly modify Zookeeper? Seems like the ZNode becomes an invalid JSON, thus causing the problem. It is very hard to make the cluster handle invalid ZNode. E.g. if you have one ideal state that is invalid JSON, the whole Helix framework will stuck.
If the modification is through regular controller API, then we should add a check to ensure it is valid JSON.
Thanks for the reply. I think the issue for this case is not invalid Json but ** NumberFormatException** where port number is a String unable to be converted to int. For such bad configuration host, we could ignore it but cannot accept it bringing down the entire cluster. One fix I can initiate is to try catch the error when broker building the instance selector map.
Can you double confirm the content of the bad config? { "id": "<some_id>", "simpleFields": { "HELIX_HOST": "<some_host>", "HELIX_PORT": "", }, "mapFields": {}, "listFields": {} } (from the description) is invalid
Can you double confirm the content of the bad config?
{ "id": "<some_id>", "simpleFields": { "HELIX_HOST": "<some_host>", "HELIX_PORT": "", }, "mapFields": {}, "listFields": {} }(from the description) is invalid
Thanks for pointing out. That was my typo. There's no "," after HELIX_PORT":"", the screenshot at the moment was like
it is a valid json
The analysis is correct. I think we should add try catch when processing each server instance config, and skip the config if it is invalid. Can you help make a fix?
Yes, I will have a PR next week.