presto
presto copied to clipboard
Disaggregated Coordinator - Multiple discovery servers do not seem to sync discovery information
I have been trying out multi-coordinator setup - https://prestodb.io/blog/2022/04/15/disggregated-coordinator. This works fine with single instance of RM (Discovery server).
However while testing with multiple RMs (Discovery servers) it was seen that the discovery servers were not replicating states among each other.
Consider the following minimal setup
RM1
http-server.http.port=8080
discovery.uri=http://localhost:8080
RM2 (announces to discovery server in RM1)
http-server.http.port=8081
discovery.uri=http://localhost:8080
Coordinator1 (announces to discovery server in RM1)
discovery.uri=http://localhost:8080
Coordinator2 (announces to discovery server in RM2)
discovery.uri=http://localhost:8081
I added some logs in DiscoveryNodeManager#refreshNodesInternal()
to periodically select all the services of type=discovery
and type=presto
-
@ServiceType("discovery") ServiceSelector discoveryServiceSelector
....
....
log.info("All known nodes to selector type: %s = %s", discoveryServiceSelector.getType(),
discoveryServiceSelector.selectAllServices());
log.info("All known nodes to selector type: %s = %s", serviceSelector.getType(),
serviceSelector.selectAllServices());
In RM1, RM2, coordinator1 we see following logs which suggest that they are able to see RM1, RM2, and coordinator1 as they all are querying the discovery server running in RM1.
2022-09-19T12:03:16.293+0530 INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager All known nodes to selector type: discovery = [ServiceDescriptor{id=eab812ac-afa5-492d-8f74-6a7d8fbb2d8c, nodeId=presto-rm1, type=discovery, pool=general, location=/presto-rm1, state=null, properties={http=http://192.168.0.133:8080, http-external=http://192.168.0.133:8080}}, ServiceDescriptor{id=abec555a-2b37-4b52-a666-c6f705dbe7f0, nodeId=presto-rm2, type=discovery, pool=general, location=/presto-rm2, state=null, properties={http=http://192.168.0.133:8081, http-external=http://192.168.0.133:8081}}]
2022-09-19T12:03:16.293+0530 INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager All known nodes to selector type: presto = [ServiceDescriptor{id=658764a6-c7cb-4dd9-bdec-c3d7338a6197, nodeId=presto-rm2, type=presto, pool=general, location=/presto-rm2, state=null, properties={node_version=0.276.1-8c89e4f, coordinator=false, resource_manager=true, catalog_server=false, http=http://192.168.0.133:8081, http-external=http://192.168.0.133:8081, connectorIds=system, thriftServerPort=63721}}, ServiceDescriptor{id=7b91fbcc-f742-4d2f-bee2-aedffe34172f, nodeId=presto-rm1, type=presto, pool=general, location=/presto-rm1, state=null, properties={node_version=0.276.1-8c89e4f, coordinator=false, resource_manager=true, catalog_server=false, http=http://192.168.0.133:8080, http-external=http://192.168.0.133:8080, connectorIds=system, thriftServerPort=63688}}, ServiceDescriptor{id=728e847d-5371-46c7-865f-941f40bf1121, nodeId=presto-c1, type=presto, pool=general, location=/presto-c1, state=RUNNING, properties={node_version=0.276.1-8c89e4f, coordinator=true, resource_manager=false, catalog_server=false, http=http://192.168.0.133:8090, http-external=http://192.168.0.133:8090, connectorIds=hive,tpcds,system,jmx, thriftServerPort=63832}}]
However in coordinator2, we see that it is only able to see itself and not others as it is the only one announcing to discovery server running in RM2.
2022-09-19T12:03:00.819+0530 INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager All known nodes to selector type: discovery = []
2022-09-19T12:03:00.819+0530 INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager All known nodes to selector type: presto = [ServiceDescriptor{id=eb96518f-c7e8-439b-8a64-89a9540503d0, nodeId=presto-c2, type=presto, pool=general, location=/presto-c2, state=RUNNING, properties={node_version=0.276.1-8c89e4f, coordinator=true, resource_manager=false, catalog_server=false, http=http://192.168.0.133:8091, http-external=http://192.168.0.133:8091, connectorIds=hive,tpcds,system,jmx, thriftServerPort=63977}}]
As a result, in coordinator2 we see an exception as well (this is just one issue, there will be others too)
2022-09-19T12:04:06.836+0530 ERROR ResourceGroupManager com.facebook.presto.execution.resourceGroups.InternalResourceGroupManager Error while executing refreshAndStartQueries
com.facebook.drift.client.UncheckedTTransportException: No hosts available
at com.facebook.drift.client.DriftInvocationHandler.invoke(DriftInvocationHandler.java:126)
at com.sun.proxy.$Proxy131.getResourceGroupInfo(Unknown Source)
at com.facebook.presto.resourcemanager.ResourceManagerResourceGroupService.getResourceGroupInfos(ResourceManagerResourceGroupService.java:85)
at com.facebook.presto.resourcemanager.ResourceManagerResourceGroupService.lambda$new$0(ResourceManagerResourceGroupService.java:70)
at com.facebook.presto.resourcemanager.ResourceManagerResourceGroupService.getResourceGroupInfo(ResourceManagerResourceGroupService.java:79)
at com.facebook.presto.execution.resourceGroups.InternalResourceGroupManager.refreshResourceGroupRuntimeInfo(InternalResourceGroupManager.java:263)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.facebook.drift.protocol.TTransportException: No hosts available
at com.facebook.drift.client.DriftMethodInvocation.fail(DriftMethodInvocation.java:331)
at com.facebook.drift.client.DriftMethodInvocation.nextAttempt(DriftMethodInvocation.java:161)
at com.facebook.drift.client.DriftMethodInvocation.createDriftMethodInvocation(DriftMethodInvocation.java:115)
at com.facebook.drift.client.DriftMethodHandler.invoke(DriftMethodHandler.java:97)
at com.facebook.drift.client.DriftInvocationHandler.invoke(DriftInvocationHandler.java:94)
... 12 more
Suppressed: com.facebook.drift.client.RetriesFailedException: No hosts available (invocationAttempts: 0, duration: 68.41us, failedConnections: 0, overloadedRejects: 0, attemptedAddresses: [])
at com.facebook.drift.client.DriftMethodInvocation.fail(DriftMethodInvocation.java:337)
... 16 more
Questions
- These discovery servers are supposed to talk to each other and replicate the information, is this a correct understanding ?
- Do we need to set some extra confs apart from the minimal ones mentioned here to have the replication working ? I was seeing a conf named
service-inventory.uri
being used in the replicator code in discover server code, does that play a role here, and what does it need to point to ? - Even if we set
discovery.uri
to a load balancer URL behind which multiple discovery servers can run, then also the deployment may not be consistent - the reason is that load balancer will get hit by the announcement calls from all the nodes, and it may get to forward calls(load balancing algo does not matter) from certain nodes to one discovery server and from other nodes to different discovery servers - this will be non deterministic, and might leave cluster in a state that some nodes are known to one discovery servers while the others are know to other ones. - Am I missing something here ?
@rohanpednekar Could you please tag someone, who can help on this?
thanks @agrawalreetika
@swapsmagic could you please check ?
@tdcmeehan do you think you can help here?
Given the example @ShubhamChaurasia gave where there is no VIP. Maybe multi-RM behaves more like a hashing/load-balancing mechanism where it behaves like 2 separate groups sharing the same Coordinator and Worker resources?
Forgive me if I'm completely in left field. Just stumbled on Presto and was interested in single point of failure RM. wondering if it's better to have multiple or if it can be ephemeral and just come up somewhere else.
- Yes that is correct, discovery servers talk to each other to provide the updated node information that they receive.
- Yes you are correct,
service-inventory.uri
is the missing config that is not documented as part of the minimal configuration (will go ahead and add that). The current implementation in the discovery server supports two options:
- file based where you can point it to a static json file returning the discovery server information. Sample here.
- url based, where it points to an endpoint returning the same information.
-
discovery.uri
supposed to be pointing to the vip, so if one of the resource manager is down, another one is being picked up. And once you have theservice-inventory.uri
set, all resource managers in the cluster will have all node information. - Nothing else is missing.
Sorry for the late response and for the confusion. Let me know if you need any help with this.
thanks for the response @swapsmagic, will try this out.
@swapsmagic Could you please explain what exactly is missing in the docs about service-inventory.uri
? Are there any ways to set up url based option to have non-static content?
I use helm to install ha-cluster https://github.com/prestodb/presto-helm-charts/tree/main
Any progress or has anyone successfully tested this? I have the same problem. I have 5 worker nodes. When there is only one resource manager, everything is normal. When there are 2 resource managers, the active workers on the web ui randomly change between 0 and 5.
resource manager many log like this:
2023-09-12T08:37:25.879Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: f6f5f500-c887-44d3-9104-e24fe5767c4b (last seen at 10.88.29.215)
@ShubhamChaurasia Have you run it successfully? Could you please share your modified configs? Thanks!
Why can't I see split scheduling information on the UI interface after deploying the multiple coordinator
I fixed an UI bug earlier, not sure if you encountered the issue: https://github.com/prestodb/presto/issues/21081
and the fix is in 0.285 already: https://github.com/prestodb/presto/pull/21088 Just want to make sure the problem you have in the UI is not related to the issue I fixed.
I fixed an UI bug earlier, not sure if you encountered the issue: #21081
and the fix is in 0.285 already: #21088 Just want to make sure the problem you have in the UI is not related to the issue I fixed.
The version I am currently using is 0.284, and I will try upgrading to 0.285.