prometheus
prometheus copied to clipboard
How to use Prometheus EC2 SD with Multiple AWS EC2 interfaces
Need some help with Prometheus EC2 SD use case.
I have multiple VPCs setup with VPC peering and EC2 with several Private interfaces and no Public interfaces/Public access available.
Setup example: VPC1 - EC2 which needs to be monitored VPC2 - Prometheus server
EC2 IPs (order is preserved):
- 192.168.1.10 (does not have routing outside VPC1)
- 192.168.2.10 (does not have routing outside and peered with VPC2)
Prometheus IP:
- 192.168.2.9 (VPC2)
Prometheus simple config with ec2_sd and node_exporter:
- job_name: 'test'
ec2_sd_configs:
- region: <region_name>
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
regex: 'test-instance'
action: keep
As a result EC2 is being discovered but address and __meta_ec2_private_ip are always take first IP from preserved order - 192.168.1.10 and hence target not able to communicate with Prometheus and down.
Proposal
Use case: Monitor EC2 with multiple interfaces and VPCs with peering on single prometheus server
Not sure how we could do it as the EC2 Api only returns one address https://docs.aws.amazon.com/sdk-for-go/api/service/ec2/#Instance
This sounds related to #7086
We could use InstanceNetworkInterface at the expense of extra API calls
@brian-brazil @roidelapluie Thanks for the feedback, as I could see #7086 is related to IPv6 label only, does it mean that we need to add additional API call ?
It'd seem odd if this needed an extra API call, nothing else has for EC2.
Sorry I didn't get it completely, but what if EC2 has no any IPv6 assigned ? And has only multiple Private IPv4 addresses.
The place that that PR is getting the v6 addresses from also has v4 addresses. So by figuring out how to do the v6, we can also do the v4 now that we've a use case to figure out the right way to transform that data.
This is the case with gce_sd_configs as well, I tested it in GCP. It's a road blocker to use prometheus service discovery in cloud with multiple nics, where first nic is not management nic. And probably this is the same issue with azure_sd_configs too.
Should I raise multiple issues for gce_sd_configs, azure_sd_configs
? Is there any tentative plan to get a fix for it?
can we have an option to pass in the index of interface we want to grab the ip for? e.g. https://github.com/prometheus/prometheus/blob/main/discovery/gce/gce.go#L174
@jfreeland Although name is more intuitive but being able to configure index seems the easiest way.. Just need error handling for the index and support for all cloud offerings.
Better than passing an index or name, we can include alll interfaces as metadata.
__meta_gce_interface_eth0_ip
__meta_gce_interface_eth1_ip
Then you can dynamically select which IP to use based on other metadata, rather than need to hard-code which interface to use in different discovery configs.
I don't want to peer the vpc. So I restricted the metrics to be available only to the prometheus instance by using security group with this ansible script https://github.com/AkashSivakumar/ansible-add-rule-to-all-security-group-aws.
And edited the prometheus.yml to scrape using public ip instead of private ip.
- job_name: 'test-ec2-sd'
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance
- source_labels: [__meta_ec2_public_ip]
replacement: ${1}:9100
target_label: __address__
With this config prometheus will use Public IP instead of Private IP for as a scraping endpoint.
@roidelapluie @SuperQ @brian-brazil @silazare @meghdivya Hey guys I think There is a bug in Prometheus that stops service discovery when filtering port. Implementation Steps:
Install Node Exporter on server1 and tag it with 9100 as per Prometheus service discovery reference.
Tag server2 in the AWS EC2 instance with a key = port and value = 8080. Do not install anything on server2.
Set up Prometheus to understand that there are two servers running, and it needs to scrape metrics 9100 from server1 and 8080 from server2.
I have exhausted all available resources on the internet, including official documentation, AI, Stack Overflow, and GitHub issues, in an attempt to solve our use case. Despite my efforts, none of the solutions have proven effective. In order to troubleshoot the issue, I attempted to break down the YAML code and implement it in a hit and try fashion. The configuration file I have been working with is the default Prometheus configuration file located at /etc/prometheus/prometheus.yml. Please find below the properly indented YAML code that I have arrived at through my experimentation.
Attempt -1
- job_name: 'myec2'
ec2_sd_configs:
- access_key:
secret_key:
region: ap-south-1
port: 9100
filters:
- name: tag:Name
values:
- Target_server
- name: tag:Name
values:
- access_key:
secret_key:
region: ap-south-1
port: 9100
filters:
Bug in below 3 lines whenever i filter port Prometheus stops service discovery.
- name: port
values:
- 8080
relabel_configs:
- source_labels: [__meta_ec2_public_ip] regex: "(.*)" replacement: "${1}:9100" target_label: address
- source_labels: [__meta_ec2_tag_Name] target_label: instance
Attempt -2
global: scrape_interval: 15s external_labels: monitor: 'prometheus'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'myec2'
ec2_sd_configs:
- access_key: secret_key: region: ap-south-1 port: 9100
port_name: "http"
#filters:
# - name: tag:Name
# values:
# - Target\_server
# - name: tag:port
# values:
# - 8080
relabel_configs:
- source\_labels: \[\_\_meta\_ec2\_tag\_prometheus\]
regex: true.\*
action: keep
- source\_labels: \[\_\_meta\_ec2\_tag\_node\_exporter\]
regex: true.\*
action: keep
- source\_labels: \[\_\_meta\_ec2\_tag\_port\]
regex: true.\*
action: keep
- source\_labels: \[\_\_meta\_ec2\_instance\_id\]
target\_label: instance
- source\_labels: \[\_\_meta\_ec2\_public\_ip\]
regex: (.\*)
replacement: "${1}:9100"
target\_label: **address**
# - source\_labels: \[tag\]
# regex: 'port=(.\*)'
# replacement: '${1}'
# target\_label: port
# - source\_labels: \[\_\_meta\_ec2\_tag\_CustomerID\]
# target\_label: CustomerID
# - source\_labels: \[\_\_meta\_ec2\_tag\_Environment\]
# target\_label: Environment
Main problem or bug ---------- filters:
- name: tag:Name values:
- Target_server
Bug in below 3 lines whenever i filter port Prometheus stops service discovery.
- name: port
values:
- 8080
Adding our use case here, we have a bunch of instances with multiple network interfaces, one with a publicly routed IP, and one with an internal IP. Prometheus's ec2_sd only detects and provides the public interface details, which is locked down hard. Something that would allow us to select the private IP on the other interface is needed to make this function correctly, these instances also scale horizontally depending on load and as such there are no guarantees on what IP address these will have. Looking at the EC2 API, DescribeInstances provides a nice list of NetworkInterfaces and all the details for each interface there, with the DeviceIndex property ordering the interfaces so we can easily tell which interface we'd be interested in.
Adding something like the below would solve it I'd expect.
__meta_ec2_network_interface_%{DeviceIndex}_ipv4_addresses = PrivateIpAddresses[].PrivateIpAddress
__meta_ec2_network_interface_%{DeviceIndex}_ipv6_addresses = Ipv6Addresses[]
Hello from the bug scrub.
We have a hard time vetting the current relevance of this. @SuperQ @roidelapluie, as you commented before, do you think this should be kept open? If so, could you provide guidance what has to be done here?
I am not the one who opened the issue but we are facing the same issue.
We want to scrape our EC2 instances using their IPv6 addresses but on some of them we have a software component (actually aws-vpc-cni) which adds/removes ENIs during runtime. This means that the __meta_ec2_ipv6_addresses
is changing as well. This wouldn't be a big problem as we can cut the first element of the list but the list ordering does not follow the ENI device index values.
Currently we use this relabel configuration for instances where we have only one ENI.
- action: replace
sourceLabels: [ __meta_ec2_ipv6_addresses ]
targetLabel: primary_ipv6_address
regex: ",(.+?),.*"
- action: replace
sourceLabels: [ primary_ipv6_address ]
targetLabel: __address__
replacement: "[$1]:9102"
- action: labeldrop
regex: ipv6_addresses
This might or might not work for instances where the ENIs are created runtime, depends on the actual ordering of the __meta_ec2_ipv6_addresses
label.
For us any of these solutions would do:
- Order the
__meta_ec2_ipv6_addresses
value based on the device index. We are not interested in keeping the gaps (when, for example you have device 0 and 2 but not 1) in the list but someone else might need them. - Create per-device
__meta_ec2_eni<idx>_*
labels. This is a bigger change and if we don't keep the old labels it is backward incompatible. I think the first option is much more viable, I might be able to put together a PR for that.