libcluster
libcluster copied to clipboard
DNS Poll - max stable Cluster Size = max DNS Entry Response Count
Steps to reproduce
- Configuration Used
config :libcluster,
debug: true,
topologies: [
dns: [
strategy: Cluster.Strategy.DNSPoll,
config: [
poll_interval: 10_000,
query: "appname.something",
node_basename: "some-container"
]
]
]
- Strategy Used
Cluster.Strategy.DNSPoll
- Errors/Incorrect Behaviour Encountered
Maximum stable Cluster Size is the number of DNS results returned.
Description of issue
-
What are the expected results? DNS query, I would not expect nodes to be removed if not in the DNS response. I would expect to trust the disconnect if a node times out with
net_ticktime
and is not actively being removed. For example, if you have 15 nodes and DNS replies with 5 random node IPs, the cluster will become unstable. -
Is the documentation incorrect? Documentation does not mention that nodes will be removed when no longer in DNS. It just says:
this strategy will periodically poll DNS and connect all nodes it finds.
Should we introduce a config flag to turn off removing nodes?
I'd be open to accepting a PR that makes removing nodes in this strategy optional based on a flag, something like prune: false
to disable pruning the node list. I believe there was a reason we actively prune nodes when the source of data for the strategy (e.g. DNS in this case, but could be any system providing service discovery) no longer reports a node as being part of the cluster, but I can't recall the specifics at the moment, but it was a specific choice. libcluster
is largely deferring to the source registry to tell us what nodes belong in the cluster. In the case of DNS, it is unusual for a node to disappear from DNS unless it is being permanently removed, but I can imagine scenarios where this might happen, such as under k8s or some other orchestrator that uses DNS for service discovery.