presto Disaggregated Coordinator - Multiple discovery servers do not seem to sync discovery information

I have been trying out multi-coordinator setup - https://prestodb.io/blog/2022/04/15/disggregated-coordinator. This works fine with single instance of RM (Discovery server).

However while testing with multiple RMs (Discovery servers) it was seen that the discovery servers were not replicating states among each other.

Consider the following minimal setup

RM1

http-server.http.port=8080
discovery.uri=http://localhost:8080

RM2 (announces to discovery server in RM1)

http-server.http.port=8081
discovery.uri=http://localhost:8080

Coordinator1 (announces to discovery server in RM1)

discovery.uri=http://localhost:8080

Coordinator2 (announces to discovery server in RM2)

discovery.uri=http://localhost:8081

I added some logs in DiscoveryNodeManager#refreshNodesInternal() to periodically select all the services of type=discovery and type=presto -

@ServiceType("discovery") ServiceSelector discoveryServiceSelector
....
....
log.info("All known nodes to selector type: %s = %s", discoveryServiceSelector.getType(),
        discoveryServiceSelector.selectAllServices());
log.info("All known nodes to selector type: %s = %s", serviceSelector.getType(),
        serviceSelector.selectAllServices());

In RM1, RM2, coordinator1 we see following logs which suggest that they are able to see RM1, RM2, and coordinator1 as they all are querying the discovery server running in RM1.

2022-09-19T12:03:16.293+0530	INFO	node-state-poller-0	com.facebook.presto.metadata.DiscoveryNodeManager	All known nodes to selector type: discovery = [ServiceDescriptor{id=eab812ac-afa5-492d-8f74-6a7d8fbb2d8c, nodeId=presto-rm1, type=discovery, pool=general, location=/presto-rm1, state=null, properties={http=http://192.168.0.133:8080, http-external=http://192.168.0.133:8080}}, ServiceDescriptor{id=abec555a-2b37-4b52-a666-c6f705dbe7f0, nodeId=presto-rm2, type=discovery, pool=general, location=/presto-rm2, state=null, properties={http=http://192.168.0.133:8081, http-external=http://192.168.0.133:8081}}]
2022-09-19T12:03:16.293+0530	INFO	node-state-poller-0	com.facebook.presto.metadata.DiscoveryNodeManager	All known nodes to selector type: presto = [ServiceDescriptor{id=658764a6-c7cb-4dd9-bdec-c3d7338a6197, nodeId=presto-rm2, type=presto, pool=general, location=/presto-rm2, state=null, properties={node_version=0.276.1-8c89e4f, coordinator=false, resource_manager=true, catalog_server=false, http=http://192.168.0.133:8081, http-external=http://192.168.0.133:8081, connectorIds=system, thriftServerPort=63721}}, ServiceDescriptor{id=7b91fbcc-f742-4d2f-bee2-aedffe34172f, nodeId=presto-rm1, type=presto, pool=general, location=/presto-rm1, state=null, properties={node_version=0.276.1-8c89e4f, coordinator=false, resource_manager=true, catalog_server=false, http=http://192.168.0.133:8080, http-external=http://192.168.0.133:8080, connectorIds=system, thriftServerPort=63688}}, ServiceDescriptor{id=728e847d-5371-46c7-865f-941f40bf1121, nodeId=presto-c1, type=presto, pool=general, location=/presto-c1, state=RUNNING, properties={node_version=0.276.1-8c89e4f, coordinator=true, resource_manager=false, catalog_server=false, http=http://192.168.0.133:8090, http-external=http://192.168.0.133:8090, connectorIds=hive,tpcds,system,jmx, thriftServerPort=63832}}]

However in coordinator2, we see that it is only able to see itself and not others as it is the only one announcing to discovery server running in RM2.

2022-09-19T12:03:00.819+0530	INFO	node-state-poller-0	com.facebook.presto.metadata.DiscoveryNodeManager	All known nodes to selector type: discovery = []
2022-09-19T12:03:00.819+0530	INFO	node-state-poller-0	com.facebook.presto.metadata.DiscoveryNodeManager	All known nodes to selector type: presto = [ServiceDescriptor{id=eb96518f-c7e8-439b-8a64-89a9540503d0, nodeId=presto-c2, type=presto, pool=general, location=/presto-c2, state=RUNNING, properties={node_version=0.276.1-8c89e4f, coordinator=true, resource_manager=false, catalog_server=false, http=http://192.168.0.133:8091, http-external=http://192.168.0.133:8091, connectorIds=hive,tpcds,system,jmx, thriftServerPort=63977}}]

As a result, in coordinator2 we see an exception as well (this is just one issue, there will be others too)

2022-09-19T12:04:06.836+0530	ERROR	ResourceGroupManager	com.facebook.presto.execution.resourceGroups.InternalResourceGroupManager	Error while executing refreshAndStartQueries
com.facebook.drift.client.UncheckedTTransportException: No hosts available
	at com.facebook.drift.client.DriftInvocationHandler.invoke(DriftInvocationHandler.java:126)
	at com.sun.proxy.$Proxy131.getResourceGroupInfo(Unknown Source)
	at com.facebook.presto.resourcemanager.ResourceManagerResourceGroupService.getResourceGroupInfos(ResourceManagerResourceGroupService.java:85)
	at com.facebook.presto.resourcemanager.ResourceManagerResourceGroupService.lambda$new$0(ResourceManagerResourceGroupService.java:70)
	at com.facebook.presto.resourcemanager.ResourceManagerResourceGroupService.getResourceGroupInfo(ResourceManagerResourceGroupService.java:79)
	at com.facebook.presto.execution.resourceGroups.InternalResourceGroupManager.refreshResourceGroupRuntimeInfo(InternalResourceGroupManager.java:263)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: com.facebook.drift.protocol.TTransportException: No hosts available
	at com.facebook.drift.client.DriftMethodInvocation.fail(DriftMethodInvocation.java:331)
	at com.facebook.drift.client.DriftMethodInvocation.nextAttempt(DriftMethodInvocation.java:161)
	at com.facebook.drift.client.DriftMethodInvocation.createDriftMethodInvocation(DriftMethodInvocation.java:115)
	at com.facebook.drift.client.DriftMethodHandler.invoke(DriftMethodHandler.java:97)
	at com.facebook.drift.client.DriftInvocationHandler.invoke(DriftInvocationHandler.java:94)
	... 12 more
	Suppressed: com.facebook.drift.client.RetriesFailedException: No hosts available (invocationAttempts: 0, duration: 68.41us, failedConnections: 0, overloadedRejects: 0, attemptedAddresses: [])
		at com.facebook.drift.client.DriftMethodInvocation.fail(DriftMethodInvocation.java:337)
		... 16 more

Questions

These discovery servers are supposed to talk to each other and replicate the information, is this a correct understanding ?
Do we need to set some extra confs apart from the minimal ones mentioned here to have the replication working ? I was seeing a conf named service-inventory.uri being used in the replicator code in discover server code, does that play a role here, and what does it need to point to ?
Even if we set discovery.uri to a load balancer URL behind which multiple discovery servers can run, then also the deployment may not be consistent - the reason is that load balancer will get hit by the announcement calls from all the nodes, and it may get to forward calls(load balancing algo does not matter) from certain nodes to one discovery server and from other nodes to different discovery servers - this will be non deterministic, and might leave cluster in a state that some nodes are known to one discovery servers while the others are know to other ones.
Am I missing something here ?

Sep 19 '22 07:09 ShubhamChaurasia

@rohanpednekar Could you please tag someone, who can help on this?

Sep 20 '22 05:09 agrawalreetika

thanks @agrawalreetika

@swapsmagic could you please check ?

Sep 20 '22 05:09 ShubhamChaurasia

@tdcmeehan do you think you can help here?

Sep 22 '22 19:09 rohanpednekar

Given the example @ShubhamChaurasia gave where there is no VIP. Maybe multi-RM behaves more like a hashing/load-balancing mechanism where it behaves like 2 separate groups sharing the same Coordinator and Worker resources?

Forgive me if I'm completely in left field. Just stumbled on Presto and was interested in single point of failure RM. wondering if it's better to have multiple or if it can be ephemeral and just come up somewhere else.

Sep 23 '22 18:09 jackic23

Yes that is correct, discovery servers talk to each other to provide the updated node information that they receive.
Yes you are correct, service-inventory.uri is the missing config that is not documented as part of the minimal configuration (will go ahead and add that). The current implementation in the discovery server supports two options:

file based where you can point it to a static json file returning the discovery server information. Sample here.
url based, where it points to an endpoint returning the same information.

discovery.uri supposed to be pointing to the vip, so if one of the resource manager is down, another one is being picked up. And once you have the service-inventory.uri set, all resource managers in the cluster will have all node information.
Nothing else is missing.

Sorry for the late response and for the confusion. Let me know if you need any help with this.

Sep 26 '22 17:09 swapsmagic

thanks for the response @swapsmagic, will try this out.

Sep 29 '22 12:09 ShubhamChaurasia

@swapsmagic Could you please explain what exactly is missing in the docs about service-inventory.uri? Are there any ways to set up url based option to have non-static content?

May 07 '23 11:05 dnskr

I use helm to install ha-cluster https://github.com/prestodb/presto-helm-charts/tree/main

Any progress or has anyone successfully tested this? I have the same problem. I have 5 worker nodes. When there is only one resource manager, everything is normal. When there are 2 resource managers, the active workers on the web ui randomly change between 0 and 5.

resource manager many log like this: 2023-09-12T08:37:25.879Z INFO node-state-poller-0 com.facebook.presto.metadata.DiscoveryNodeManager Previously active node is missing: f6f5f500-c887-44d3-9104-e24fe5767c4b (last seen at 10.88.29.215)

Sep 12 '23 08:09 pycgo

@ShubhamChaurasia Have you run it successfully? Could you please share your modified configs? Thanks!

Dec 25 '23 02:12 KIRITOLTR

Why can't I see split scheduling information on the UI interface after deploying the multiple coordinator

Jan 12 '24 08:01 0ZhangJc0

I fixed an UI bug earlier, not sure if you encountered the issue: https://github.com/prestodb/presto/issues/21081

and the fix is in 0.285 already: https://github.com/prestodb/presto/pull/21088 Just want to make sure the problem you have in the UI is not related to the issue I fixed.

Jan 16 '24 18:01 yhwang

I fixed an UI bug earlier, not sure if you encountered the issue: #21081

and the fix is in 0.285 already: #21088 Just want to make sure the problem you have in the UI is not related to the issue I fixed.

The version I am currently using is 0.284, and I will try upgrading to 0.285.

Jan 19 '24 09:01 0ZhangJc0

presto presto copied to clipboard

Disaggregated Coordinator - Multiple discovery servers do not seem to sync discovery information

presto
presto copied to clipboard