prometheus icon indicating copy to clipboard operation
prometheus copied to clipboard

How to use Prometheus EC2 SD with Multiple AWS EC2 interfaces

Open silazare opened this issue 4 years ago • 15 comments

Need some help with Prometheus EC2 SD use case.

I have multiple VPCs setup with VPC peering and EC2 with several Private interfaces and no Public interfaces/Public access available.

Setup example: VPC1 - EC2 which needs to be monitored VPC2 - Prometheus server

EC2 IPs (order is preserved):

  • 192.168.1.10 (does not have routing outside VPC1)
  • 192.168.2.10 (does not have routing outside and peered with VPC2)

Prometheus IP:

  • 192.168.2.9 (VPC2)

Prometheus simple config with ec2_sd and node_exporter:

    - job_name: 'test'
      ec2_sd_configs:
        - region: <region_name>
          port: 9100
      relabel_configs:
        - source_labels: [__meta_ec2_tag_Name]
          regex: 'test-instance'
          action: keep

As a result EC2 is being discovered but address and __meta_ec2_private_ip are always take first IP from preserved order - 192.168.1.10 and hence target not able to communicate with Prometheus and down.

Proposal

Use case: Monitor EC2 with multiple interfaces and VPCs with peering on single prometheus server

silazare avatar Jun 16 '20 11:06 silazare

Not sure how we could do it as the EC2 Api only returns one address https://docs.aws.amazon.com/sdk-for-go/api/service/ec2/#Instance

roidelapluie avatar Jun 16 '20 11:06 roidelapluie

This sounds related to #7086

brian-brazil avatar Jun 16 '20 11:06 brian-brazil

We could use InstanceNetworkInterface at the expense of extra API calls

roidelapluie avatar Jun 16 '20 11:06 roidelapluie

@brian-brazil @roidelapluie Thanks for the feedback, as I could see #7086 is related to IPv6 label only, does it mean that we need to add additional API call ?

silazare avatar Jun 16 '20 16:06 silazare

It'd seem odd if this needed an extra API call, nothing else has for EC2.

brian-brazil avatar Jun 16 '20 16:06 brian-brazil

Sorry I didn't get it completely, but what if EC2 has no any IPv6 assigned ? And has only multiple Private IPv4 addresses.

silazare avatar Jun 17 '20 08:06 silazare

The place that that PR is getting the v6 addresses from also has v4 addresses. So by figuring out how to do the v6, we can also do the v4 now that we've a use case to figure out the right way to transform that data.

brian-brazil avatar Jun 17 '20 11:06 brian-brazil

This is the case with gce_sd_configs as well, I tested it in GCP. It's a road blocker to use prometheus service discovery in cloud with multiple nics, where first nic is not management nic. And probably this is the same issue with azure_sd_configs too.

Should I raise multiple issues for gce_sd_configs, azure_sd_configs? Is there any tentative plan to get a fix for it?

meghdivya avatar May 12 '21 11:05 meghdivya

can we have an option to pass in the index of interface we want to grab the ip for? e.g. https://github.com/prometheus/prometheus/blob/main/discovery/gce/gce.go#L174

jfreeland avatar Jun 15 '21 22:06 jfreeland

@jfreeland Although name is more intuitive but being able to configure index seems the easiest way.. Just need error handling for the index and support for all cloud offerings.

meghdivya avatar Jun 16 '21 02:06 meghdivya

Better than passing an index or name, we can include alll interfaces as metadata.

__meta_gce_interface_eth0_ip
__meta_gce_interface_eth1_ip

Then you can dynamically select which IP to use based on other metadata, rather than need to hard-code which interface to use in different discovery configs.

SuperQ avatar Jun 16 '21 05:06 SuperQ

I don't want to peer the vpc. So I restricted the metrics to be available only to the prometheus instance by using security group with this ansible script https://github.com/AkashSivakumar/ansible-add-rule-to-all-security-group-aws.

And edited the prometheus.yml to scrape using public ip instead of private ip.

- job_name: 'test-ec2-sd'
  relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance
      - source_labels: [__meta_ec2_public_ip]
        replacement: ${1}:9100
        target_label: __address__

With this config prometheus will use Public IP instead of Private IP for as a scraping endpoint.

AkashSivakumar avatar Dec 26 '21 16:12 AkashSivakumar

@roidelapluie @SuperQ @brian-brazil @silazare @meghdivya Hey guys I think There is a bug in Prometheus that stops service discovery when filtering port. Implementation Steps:

Install Node Exporter on server1 and tag it with 9100 as per Prometheus service discovery reference.

Tag server2 in the AWS EC2 instance with a key = port and value = 8080. Do not install anything on server2.

Set up Prometheus to understand that there are two servers running, and it needs to scrape metrics 9100 from server1 and 8080 from server2.

I have exhausted all available resources on the internet, including official documentation, AI, Stack Overflow, and GitHub issues, in an attempt to solve our use case. Despite my efforts, none of the solutions have proven effective. In order to troubleshoot the issue, I attempted to break down the YAML code and implement it in a hit and try fashion. The configuration file I have been working with is the default Prometheus configuration file located at /etc/prometheus/prometheus.yml. Please find below the properly indented YAML code that I have arrived at through my experimentation.

Attempt -1

  • job_name: 'myec2' ec2_sd_configs:
    • access_key: secret_key: region: ap-south-1 port: 9100 filters:
      • name: tag:Name values:
        • Target_server

Bug in below 3 lines whenever i filter port Prometheus stops service discovery.

 - name: port    
   values:
     - 8080

relabel_configs:

  • source_labels: [__meta_ec2_public_ip] regex: "(.*)" replacement: "${1}:9100" target_label: address
  • source_labels: [__meta_ec2_tag_Name] target_label: instance

Attempt -2

global: scrape_interval: 15s external_labels: monitor: 'prometheus'

scrape_configs:

  • job_name: 'prometheus' static_configs:
    • targets: ['localhost:9090']
  • job_name: 'myec2' ec2_sd_configs:
    • access_key: secret_key: region: ap-south-1 port: 9100

port_name: "http"

#filters:

# - name: tag:Name
# values:
# - Target\_server
# - name: tag:port
# values:
# - 8080

relabel_configs:

    - source\_labels: \[\_\_meta\_ec2\_tag\_prometheus\]
    regex: true.\*
    action: keep
    - source\_labels: \[\_\_meta\_ec2\_tag\_node\_exporter\]
    regex: true.\*
    action: keep
    - source\_labels: \[\_\_meta\_ec2\_tag\_port\]
    regex: true.\*
    action: keep
    - source\_labels: \[\_\_meta\_ec2\_instance\_id\]
    target\_label: instance
    - source\_labels: \[\_\_meta\_ec2\_public\_ip\]
    regex: (.\*)
    replacement: "${1}:9100"
    target\_label: **address**
# - source\_labels: \[tag\]
# regex: 'port=(.\*)'
# replacement: '${1}'
# target\_label: port
# - source\_labels: \[\_\_meta\_ec2\_tag\_CustomerID\]
# target\_label: CustomerID
# - source\_labels: \[\_\_meta\_ec2\_tag\_Environment\]
# target\_label: Environment

Main problem or bug ---------- filters:

  • name: tag:Name values:
  • Target_server

Bug in below 3 lines whenever i filter port Prometheus stops service discovery.

 - name: port    
   values:
     - 8080

gamechanger1s avatar May 15 '23 05:05 gamechanger1s

Adding our use case here, we have a bunch of instances with multiple network interfaces, one with a publicly routed IP, and one with an internal IP. Prometheus's ec2_sd only detects and provides the public interface details, which is locked down hard. Something that would allow us to select the private IP on the other interface is needed to make this function correctly, these instances also scale horizontally depending on load and as such there are no guarantees on what IP address these will have. Looking at the EC2 API, DescribeInstances provides a nice list of NetworkInterfaces and all the details for each interface there, with the DeviceIndex property ordering the interfaces so we can easily tell which interface we'd be interested in.

Adding something like the below would solve it I'd expect.

__meta_ec2_network_interface_%{DeviceIndex}_ipv4_addresses = PrivateIpAddresses[].PrivateIpAddress
__meta_ec2_network_interface_%{DeviceIndex}_ipv6_addresses = Ipv6Addresses[]

gwvandesteeg avatar May 18 '23 00:05 gwvandesteeg

Hello from the bug scrub.

We have a hard time vetting the current relevance of this. @SuperQ @roidelapluie, as you commented before, do you think this should be kept open? If so, could you provide guidance what has to be done here?

beorn7 avatar Apr 30 '24 11:04 beorn7

I am not the one who opened the issue but we are facing the same issue.

We want to scrape our EC2 instances using their IPv6 addresses but on some of them we have a software component (actually aws-vpc-cni) which adds/removes ENIs during runtime. This means that the __meta_ec2_ipv6_addresses is changing as well. This wouldn't be a big problem as we can cut the first element of the list but the list ordering does not follow the ENI device index values. Currently we use this relabel configuration for instances where we have only one ENI.

    - action: replace
      sourceLabels: [ __meta_ec2_ipv6_addresses ]
      targetLabel: primary_ipv6_address
      regex: ",(.+?),.*"
    - action: replace
      sourceLabels: [ primary_ipv6_address ]
      targetLabel: __address__
      replacement: "[$1]:9102"
    - action: labeldrop
      regex: ipv6_addresses

This might or might not work for instances where the ENIs are created runtime, depends on the actual ordering of the __meta_ec2_ipv6_addresses label. For us any of these solutions would do:

  • Order the __meta_ec2_ipv6_addresses value based on the device index. We are not interested in keeping the gaps (when, for example you have device 0 and 2 but not 1) in the list but someone else might need them.
  • Create per-device __meta_ec2_eni<idx>_* labels. This is a bigger change and if we don't keep the old labels it is backward incompatible. I think the first option is much more viable, I might be able to put together a PR for that.

akunszt avatar May 24 '24 12:05 akunszt