alloy icon indicating copy to clipboard operation
alloy copied to clipboard

Fix DNS SRV port not used in cluster.join-addresses

Open devsunb opened this issue 1 year ago • 1 comments

PR Description

Fixed an issue where DNS SRV record's port is not used in cluster.join-addresses.

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

  • [x] CHANGELOG.md updated
  • [x] Documentation added

devsunb avatar May 04 '24 12:05 devsunb

Thanks for opening this!

@tpaschalis @wildum can one of you take a look at this one? You know more about this code than I do.

rfratto avatar May 14 '24 16:05 rfratto

Hey @devsunb, how's it going? First off, apologies for the belated response here, it shouldn't have taken so long for me to get back to you here.

Could you explain more about the intentions of this PR? As I said in the previous comment, I feel this would depend on DNS SRV records that are explicitly defined for the usecase of Alloy clustering and may break usecases where a more generic record is used for discovery.

tpaschalis avatar May 27 '24 10:05 tpaschalis

@tpaschalis Thank you for your response. Let me give you some background on this PR.

I am using AWS ECS and AWS Service Discovery for internal DNS between containers. One of the issues with ECS and Service Discovery is that only one Service Discovery can be configured for one ECS Service, so it was impossible to configure SRV records for multiple ports in one Service, or to configure SRV record and A record together. Therefore, I was configuring services using only A record. (For example, for running an etcd cluster, I created an init script that looks up the IP list of running containers based on A record and passes it as option).

In this context, I wanted to cluster Alloy using A record, but when I passed just the domain name with the --cluster.join-addresses option, it didn't work, so I checked the documentation but I couldn't find any information about A record support, so I checked the code.

I found that the appendJoinAddr function in the internal/alloycli/cluster_builder.go file behaves as follows for values of the --cluster.join-addresses option

  1. If port is specified -> Use as is as peer
  2. If port is not specified
    1. If IP -> Use <IP>:<Default Port> as peer
    2. If Domain Name -> Use <SRV Record Targets>:<Default Port> as peer

At first, I thought it was because Alloy didn't define behavior for A record in 2-ii, so I wrote code for A record support in that part and it worked fine, so I was going to create a PR, but when I looked a little further into the code, I found that hashicorp/memberlist (which Alloy uses for peer clustering) already has code for A record support. https://github.com/hashicorp/memberlist/blob/3f82dc10a89f82efe300228752f7077d0d9f87e4/memberlist.go#L417

So I reverted the change and instead specified the port while passing the domain name as --cluster.join-addresses option. Then, Alloy will use the domain name as peer and look up the A record in hashicorp/memberlist, so I expected it to work fine, and it did. I thought this option structure made sense because when using A record, unlike SRV record, the port is not in the record, so it should be specified.

And I decided to change the direction of the PR and make it a PR to document the specifics of the --cluster.join-addresses option's support for A record.

However, there was one question that remained, in 2-ii, since the SRV record contains the port, why is it using the default port instead of using it, so I fixed it and included it in the PR. Now I understand from your response that the port in the SRV record will not always be correct, so it makes sense to keep using the default port.

To summarize, there are the following cases when using the --cluster.join-addresses option for clustering Alloy.

  1. <IP> -> Used as <IP>:<Default Port>
  2. <IP>:<Port> -> Used as <IP>:<Port>
  3. <Domain Name> -> Used as <SRV Record Targets>:<Default Port>
  4. <Domain Name>:<Port> -> Used as <A Record IPs>:<Port>

Since it seems to be documented by https://github.com/grafana/alloy/pull/910 that Alloy does not support case 4, why not include in the documentation that there is a way to use A record for Alloy clustering as well, for people in similar situations to mine?

In any case, I closed this PR because I understood why the main request won't merge.

devsunb avatar May 28 '24 15:05 devsunb