telegraf [inputs.snmp] does not respect new network interfaces

Relevant telegraf.conf

[agent]
  debug = false
  snmp_translator = "gosmi"
  interval = "5s"
  flush_interval = "5s"
  flush_jitter = "0s"

[[outputs.file]]
  files = ["stdout"]
  data_format = "influx"

[[inputs.snmp]]
  agents = ["10.37.155.80"]
  timeout = "1s"
  retries = 0
  [[inputs.snmp.field]]
    oid = "1.3.6.1.2.1.1.3.0"
    name = "sysUpTime"

[[inputs.ping]]
  urls = ["10.37.155.80"]
  method = "exec"
  count = 1
  timeout = 1.0
  fieldinclude = ["result_code", "maximum_response_ms"]

System info

Telegraf 1.30.3, Ubuntu 22.04

Steps to reproduce

Launch a new EC2 instance with limited network access
Run telegraf. Observe that both inputs (ping and snmp) are failing
Attach a privileged ENI to the instance and make it the default route
Observe that ping is now working, but SNMP is not
Restart telegraf process and observe that both inputs work

Expected behavior

All input plugins should behave in the same way, using the new default source IP according to the system route table.

Actual behavior

SNMP input plugin continues using the old source IP until the telegraf process is restarted.

Additional info

Context

I run telegraf in Amazon EC2. The instance is launched with a default network interface which does not have permission to reach most targets through the firewall. Soon after boot-time, a system daemon attaches an ENI (elastic network interface) in the same subnet, which has a static IP with permission to get through the firewall.

The Problem

After adding a new ENI and making it the default route, Telegraf's SNMP input plugin continues sending packets from the old source IP.

Details

Route table before adding ENI:

$ ip route
default via 192.168.24.1 dev eth0 proto dhcp src 192.168.24.49 metric 100 
default via 192.168.24.1 dev eth0 metric 999 
192.168.24.0/26 dev eth0 proto kernel scope link src 192.168.24.49 metric 100 
192.168.24.1 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100 
192.168.24.2 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100

Route table after adding ENI:

$ ip route
default via 192.168.24.1 dev eth1 metric 1 
default via 192.168.24.1 dev eth0 proto dhcp src 192.168.24.49 metric 100 
default via 192.168.24.1 dev eth1 proto dhcp src 192.168.24.51 metric 200 
default via 192.168.24.1 dev eth0 metric 999 
192.168.24.0/26 dev eth0 proto kernel scope link src 192.168.24.49 metric 100 
192.168.24.0/26 dev eth1 proto kernel scope link src 192.168.24.51 metric 200 
192.168.24.1 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100 
192.168.24.1 dev eth1 proto dhcp scope link src 192.168.24.51 metric 200 
192.168.24.2 dev eth0 proto dhcp scope link src 192.168.24.49 metric 100 
192.168.24.2 dev eth1 proto dhcp scope link src 192.168.24.51 metric 200

Telegraf log before adding the ENI:

2024-08-25T19:39:46Z E! [inputs.snmp] Error in plugin: agent 10.37.155.80: performing get on field sysUpTime: request timeout (after 0 retries)
2024-08-25T19:39:50Z W! [inputs.ping] Collection took longer than expected; not complete after interval of 5s

Telegraf log after adding the ENI:

ping,host=aragorn-debug-i-0a885b5e0d5edd0a3,url=10.37.155.80 maximum_response_ms=33.831,result_code=0i 1724615055000000000
2024-08-25T19:44:21Z E! [inputs.snmp] Error in plugin: agent 10.37.155.80: performing get on field sysUpTime: request timeout (after 0 retries)

Telegraf log after restarting the process:

ping,host=aragorn-debug-i-0a885b5e0d5edd0a3,url=10.37.155.80 maximum_response_ms=33.721,result_code=0i 1724615265000000000
snmp,agent_host=10.37.155.80,host=aragorn-debug-i-0a885b5e0d5edd0a3 sysUpTime=682354i 1724615265000000000

tcpdump output:

# before adding ENI
19:54:50.001367 eth0  Out IP 192.168.24.49.33705 > 10.37.155.80.161:  GetRequest(28)  .1.3.6.1.2.1.1.3.0
19:54:50.103908 eth0  Out IP 192.168.24.49 > 10.37.155.80: ICMP echo request, id 291, seq 6, length 24
# after adding ENI
19:55:40.003930 eth1  Out IP 192.168.24.49.33705 > 10.37.155.80.161:  GetRequest(28)  .1.3.6.1.2.1.1.3.0
19:55:40.005212 eth1  Out IP 192.168.24.51 > 10.37.155.80: ICMP echo request, id 300, seq 1, length 24
19:55:40.038922 eth1  In  IP 10.37.155.80 > 192.168.24.51: ICMP echo reply, id 300, seq 1, length 24
# after restarting telegraf
19:55:45.000496 eth1  Out IP 192.168.24.51.63628 > 10.37.155.80.161:  GetRequest(28)  .1.3.6.1.2.1.1.3.0
19:55:45.004776 eth1  Out IP 192.168.24.51 > 10.37.155.80: ICMP echo request, id 301, seq 1, length 24
19:55:45.038505 eth1  In  IP 10.37.155.80 > 192.168.24.51: ICMP echo reply, id 301, seq 1, length 24
19:55:45.069883 eth1  In  IP 10.37.155.80.161 > 192.168.24.51.63628:  GetResponse(31)  .1.3.6.1.2.1.1.3.0=730354

Example commands to detach and re-attach ENI for debugging:

INSTANCE_ID=xxx
ENI_ID=xxx
ATTACHMENT_ID=$(aws --region us-west-2 --output text ec2 describe-network-interfaces --network-interface-ids $ENI_ID --query 'NetworkInterfaces[0].Attachment.AttachmentId')
aws --region us-west-2 ec2 detach-network-interface --attachment-id $ATTACHMENT_ID
aws --region us-west-2 ec2 attach-network-interface --device-index 1 --network-interface-id $ENI_ID --instance-id $INSTANCE_ID

Aug 25 '24 20:08 llamafilm

Next steps: Check error in SNMP and reconnect on timeout (or network errors in general).

Aug 29 '24 09:08 srebhan

@srebhan are you asking me to do something? Sorry I didn't understand your comment.

Aug 29 '24 20:08 llamafilm

@llamafilm happy to see a PR from your side, but this was more a note to myself as I couldn't work on it immediately. ;-)

Aug 30 '24 06:08 srebhan

FWIW, this issue is still the same in 1.32.1.

Oct 15 '24 08:10 llamafilm

What happens if you enable a retry in the snmp input?

Nov 06 '24 10:11 Hipska

That doesn't help. It still uses the old interface until telegraf is restarted.

2024-11-06T17:19:17Z E! [inputs.snmp] Error in plugin: agent 10.91.77.31: performing get on field sysUpTime: request timeout (after 1 retries)

Nov 06 '24 17:11 llamafilm

I also noticed the opposite problem. If the interface used by inputs.snmp is removed, it fails like this:

2024-11-21T02:02:50Z E! [inputs.snmp] Error in plugin: agent 10.91.77.132: performing get on field projRunTime: write udp 192.168.24.59:40788->10.91.77.132:161: write: network is unreachable

(Instead, it should automatically switch to using the other interface)

Nov 22 '24 21:11 llamafilm