fluent-bit
fluent-bit copied to clipboard
in_kubernetes_events: support for net.* options including TCP keepalive settings
Support net.* options including TCP keepalive settings in the kubernetes_events plugin
This allows the user to set net.* options in the kubernetes_events input plugin config. This is particularly useful for configuring TCP keepalive settings because kubernetes_events opens a watch on the Kubernetes API, which is a long-running connection that might see long periods of inactivity during which intermediate networking infrastructure like proxies might drop the connection silently. The Go K8s client sends keepalives automatically, e.g. kubectl get event -w (which opens a watch on k8s events similar to the kubernetes_events plugin) will send keepalives every 30s without the user having to configure anything (those will be HTTP/2 pings rather than raw zero-length TCP keepalives, but serves the same purpose).
Testing Before we can approve your change; please submit the following in a comment:
- [x] Example configuration file for the change
- [x] Debug log output from testing the change
Sample config:
[INPUT]
name kubernetes_events
tag k8s_events
kube_url https://kubernetes.default.svc
interval_sec 120
net.keepalive on
# TCP keepalives every 20s, drop connection after 2 failed probes
net.tcp_keepalive on
net.tcp_keepalive_time 20
net.tcp_keepalive_interval 20
net.tcp_keepalive_probes 2
tcpdump extract:
# client (fluent-bit): fbit-5948f95f98-4frst.47272
# server (K8s API): kubernetes.default.svc.cluster.local.https
# regular event being reported by the API
15:14:34.051110 eth0 In IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [P.], seq 436095:437316, ack 3239, win 490, options [nop,nop,TS val 2459710268 ecr 707590365], length 1221
15:14:34.051134 eth0 Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707603915 ecr 2459710268], length 0
# TCP keepalives sent by fluent-bit and ACKed by API after 20s of inactivity, and then every 20s afterwards
15:14:54.500436 eth0 Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707624365 ecr 2459710268], length 0
15:14:54.502250 eth0 In IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459730719 ecr 707603915], length 0
15:15:14.980418 eth0 Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707644845 ecr 2459730719], length 0
15:15:14.981103 eth0 In IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459751198 ecr 707603915], length 0
15:15:35.460342 eth0 Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707665325 ecr 2459751198], length 0
15:15:35.462084 eth0 In IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459771679 ecr 707603915], length 0
Other config: Longer timeout (120s) to test reconnect
net.tcp_keepalive on
net.tcp_keepalive_time 120
net.tcp_keepalive_interval 120
net.tcp_keepalive_probes 1
tcpdump:
# regular event being reported by the API
15:30:26.795408 eth0 In IP kubernetes.default.svc.cluster.local.https > fbit-6bf68b4f74-vc2dk.52592: Flags [P.], seq 4202315726:4202316947, ack 1626483173, win 524, options [nop,nop,TS val 1489545531 ecr 2059981423], length 1221
15:30:26.795440 eth0 Out IP fbit-6bf68b4f74-vc2dk.52592 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 1221, win 761, options [nop,nop,TS val 2059994680 ecr 1489545531], length 0
# Somewhere during this period, the proxy that's being used in this test drops the connection silently.
# After 120s of inactivity, fluent-bit sends a keepalive probe, to which the proxy replies with a Reset packet because
# it doesn't know the connection anymore:
15:32:27.172331 eth0 Out IP fbit-6bf68b4f74-vc2dk.52592 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 1221, win 761, options [nop,nop,TS val 2060115057 ecr 1489545531], length 0
15:32:27.174135 eth0 In IP kubernetes.default.svc.cluster.local.https > fbit-6bf68b4f74-vc2dk.52592: Flags [R], seq 4202316947, win 0, length 0
As a consequence, fluent-bit recreates the connection and catches up on the events that might have happened in the meantime:
[2025/06/17 15:32:27] [error] [/src/fluent-bit/src/tls/openssl.c:904 errno=104] Connection reset by peer
[2025/06/17 15:32:27] [error] [tls] syscall error: error:00000005:lib(0)::reason(5)
[2025/06/17 15:32:27] [error] [http_client] broken connection to kubernetes.default.svc:443 ?
[2025/06/17 15:32:27] [ warn] [input:kubernetes_events:kubernetes_events.0] kubernetes chunked stream error.
[2025/06/17 15:32:27] [ info] [input:kubernetes_events:kubernetes_events.0] kubernetes stream disconnected, ret=-1
[2025/06/17 15:33:01] [ info] [input:kubernetes_events:kubernetes_events.0] Requesting /api/v1/events?watch=1&resourceVersion=7869278
k8s_events: [1750174158.000000000, {"metadata":{"name":"myevent-1750174158","namespace":"..