[inputs.snmp] does not respect new network interfaces
Relevant telegraf.conf
[agent]
debug = false
snmp_translator = "gosmi"
interval = "5s"
flush_interval = "5s"
flush_jitter = "0s"
[[outputs.file]]
files = ["stdout"]
data_format = "influx"
[[inputs.snmp]]
agents = ["10.37.155.80"]
timeout = "1s"
retries = 0
[[inputs.snmp.field]]
oid = "1.3.6.1.2.1.1.3.0"
name = "sysUpTime"
[[inputs.ping]]
urls = ["10.37.155.80"]
method = "exec"
count = 1
timeout = 1.0
fieldinclude = ["result_code", "maximum_response_ms"]
System info
Telegraf 1.30.3, Ubuntu 22.04
Steps to reproduce
- Launch a new EC2 instance with limited network access
- Run telegraf. Observe that both inputs (ping and snmp) are failing
- Attach a privileged ENI to the instance and make it the default route
- Observe that ping is now working, but SNMP is not
- Restart telegraf process and observe that both inputs work
Expected behavior
All input plugins should behave in the same way, using the new default source IP according to the system route table.
Actual behavior
SNMP input plugin continues using the old source IP until the telegraf process is restarted.
Additional info
Context
I run telegraf in Amazon EC2. The instance is launched with a default network interface which does not have permission to reach most targets through the firewall. Soon after boot-time, a system daemon attaches an ENI (elastic network interface) in the same subnet, which has a static IP with permission to get through the firewall.
The Problem
After adding a new ENI and making it the default route, Telegraf's SNMP input plugin continues sending packets from the old source IP.
Details
Route table before adding ENI:
$ ip route
default via 192.168.24.1 dev eth0 proto dhcp src 192.168.24.49 metric 100
default via 192.168.24.1 dev eth0 metric 999
192.168.24.0/26 dev eth0 proto kernel scope link src 192.168.24.49 metric 100
192.168.24.1 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100
192.168.24.2 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100
Route table after adding ENI:
$ ip route
default via 192.168.24.1 dev eth1 metric 1
default via 192.168.24.1 dev eth0 proto dhcp src 192.168.24.49 metric 100
default via 192.168.24.1 dev eth1 proto dhcp src 192.168.24.51 metric 200
default via 192.168.24.1 dev eth0 metric 999
192.168.24.0/26 dev eth0 proto kernel scope link src 192.168.24.49 metric 100
192.168.24.0/26 dev eth1 proto kernel scope link src 192.168.24.51 metric 200
192.168.24.1 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100
192.168.24.1 dev eth1 proto dhcp scope link src 192.168.24.51 metric 200
192.168.24.2 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100
192.168.24.2 dev eth1 proto dhcp scope link src 192.168.24.51 metric 200
Telegraf log before adding the ENI:
2024-08-25T19:39:46Z E! [inputs.snmp] Error in plugin: agent 10.37.155.80: performing get on field sysUpTime: request timeout (after 0 retries)
2024-08-25T19:39:50Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 5s
Telegraf log after adding the ENI:
ping,host=aragorn-debug-i-0a885b5e0d5edd0a3,url=10.37.155.80 maximum_response_ms=33.831,result_code=0i 1724615055000000000
2024-08-25T19:44:21Z E! [inputs.snmp] Error in plugin: agent 10.37.155.80: performing get on field sysUpTime: request timeout (after 0 retries)
Telegraf log after restarting the process:
ping,host=aragorn-debug-i-0a885b5e0d5edd0a3,url=10.37.155.80 maximum_response_ms=33.721,result_code=0i 1724615265000000000
snmp,agent_host=10.37.155.80,host=aragorn-debug-i-0a885b5e0d5edd0a3 sysUpTime=682354i 1724615265000000000
tcpdump output:
# before adding ENI
19:54:50.001367 eth0 Out IP 192.168.24.49.33705 > 10.37.155.80.161: GetRequest(28) .1.3.6.1.2.1.1.3.0
19:54:50.103908 eth0 Out IP 192.168.24.49 > 10.37.155.80: ICMP echo request, id 291, seq 6, length 24
# after adding ENI
19:55:40.003930 eth1 Out IP 192.168.24.49.33705 > 10.37.155.80.161: GetRequest(28) .1.3.6.1.2.1.1.3.0
19:55:40.005212 eth1 Out IP 192.168.24.51 > 10.37.155.80: ICMP echo request, id 300, seq 1, length 24
19:55:40.038922 eth1 In IP 10.37.155.80 > 192.168.24.51: ICMP echo reply, id 300, seq 1, length 24
# after restarting telegraf
19:55:45.000496 eth1 Out IP 192.168.24.51.63628 > 10.37.155.80.161: GetRequest(28) .1.3.6.1.2.1.1.3.0
19:55:45.004776 eth1 Out IP 192.168.24.51 > 10.37.155.80: ICMP echo request, id 301, seq 1, length 24
19:55:45.038505 eth1 In IP 10.37.155.80 > 192.168.24.51: ICMP echo reply, id 301, seq 1, length 24
19:55:45.069883 eth1 In IP 10.37.155.80.161 > 192.168.24.51.63628: GetResponse(31) .1.3.6.1.2.1.1.3.0=730354
Example commands to detach and re-attach ENI for debugging:
INSTANCE_ID=xxx
ENI_ID=xxx
ATTACHMENT_ID=$(aws --region us-west-2 --output text ec2 describe-network-interfaces --network-interface-ids $ENI_ID --query 'NetworkInterfaces[0].Attachment.AttachmentId')
aws --region us-west-2 ec2 detach-network-interface --attachment-id $ATTACHMENT_ID
aws --region us-west-2 ec2 attach-network-interface --device-index 1 --network-interface-id $ENI_ID --instance-id $INSTANCE_ID
Next steps: Check error in SNMP and reconnect on timeout (or network errors in general).
@srebhan are you asking me to do something? Sorry I didn't understand your comment.
@llamafilm happy to see a PR from your side, but this was more a note to myself as I couldn't work on it immediately. ;-)
FWIW, this issue is still the same in 1.32.1.
What happens if you enable a retry in the snmp input?
That doesn't help. It still uses the old interface until telegraf is restarted.
2024-11-06T17:19:17Z E! [inputs.snmp] Error in plugin: agent 10.91.77.31: performing get on field sysUpTime: request timeout (after 1 retries)
I also noticed the opposite problem. If the interface used by inputs.snmp is removed, it fails like this:
2024-11-21T02:02:50Z E! [inputs.snmp] Error in plugin: agent 10.91.77.132: performing get on field projRunTime: write udp 192.168.24.59:40788->10.91.77.132:161: write: network is unreachable
(Instead, it should automatically switch to using the other interface)