fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

in_kubernetes_events: support for net.* options including TCP keepalive settings

Open multi-io opened this issue 4 months ago • 0 comments

Support net.* options including TCP keepalive settings in the kubernetes_events plugin

This allows the user to set net.* options in the kubernetes_events input plugin config. This is particularly useful for configuring TCP keepalive settings because kubernetes_events opens a watch on the Kubernetes API, which is a long-running connection that might see long periods of inactivity during which intermediate networking infrastructure like proxies might drop the connection silently. The Go K8s client sends keepalives automatically, e.g. kubectl get event -w (which opens a watch on k8s events similar to the kubernetes_events plugin) will send keepalives every 30s without the user having to configure anything (those will be HTTP/2 pings rather than raw zero-length TCP keepalives, but serves the same purpose).

Testing Before we can approve your change; please submit the following in a comment:

  • [x] Example configuration file for the change
  • [x] Debug log output from testing the change

Sample config:

    [INPUT]
        name kubernetes_events
        tag k8s_events
        kube_url https://kubernetes.default.svc
        interval_sec 120
        net.keepalive on
       # TCP keepalives every 20s, drop connection after 2 failed probes
        net.tcp_keepalive on
        net.tcp_keepalive_time 20
        net.tcp_keepalive_interval 20
        net.tcp_keepalive_probes 2

tcpdump extract:

# client (fluent-bit): fbit-5948f95f98-4frst.47272
# server (K8s API): kubernetes.default.svc.cluster.local.https

# regular event being reported by the API 
15:14:34.051110 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [P.], seq 436095:437316, ack 3239, win 490, options [nop,nop,TS val 2459710268 ecr 707590365], length 1221
15:14:34.051134 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707603915 ecr 2459710268], length 0

# TCP keepalives sent by fluent-bit and ACKed by API after 20s of inactivity, and then every 20s afterwards
15:14:54.500436 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707624365 ecr 2459710268], length 0
15:14:54.502250 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459730719 ecr 707603915], length 0
15:15:14.980418 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707644845 ecr 2459730719], length 0
15:15:14.981103 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459751198 ecr 707603915], length 0
15:15:35.460342 eth0  Out IP fbit-5948f95f98-4frst.47272 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 437316, win 5866, options [nop,nop,TS val 707665325 ecr 2459751198], length 0
15:15:35.462084 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-5948f95f98-4frst.47272: Flags [.], ack 3239, win 490, options [nop,nop,TS val 2459771679 ecr 707603915], length 0

Other config: Longer timeout (120s) to test reconnect

    net.tcp_keepalive on
    net.tcp_keepalive_time 120
    net.tcp_keepalive_interval 120
    net.tcp_keepalive_probes 1

tcpdump:

# regular event being reported by the API
15:30:26.795408 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-6bf68b4f74-vc2dk.52592: Flags [P.], seq 4202315726:4202316947, ack 1626483173, win 524, options [nop,nop,TS val 1489545531 ecr 2059981423], length 1221
15:30:26.795440 eth0  Out IP fbit-6bf68b4f74-vc2dk.52592 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 1221, win 761, options [nop,nop,TS val 2059994680 ecr 1489545531], length 0

# Somewhere during this period, the proxy that's being used in this test drops the connection silently.

# After 120s of inactivity, fluent-bit sends a keepalive probe, to which the proxy replies with a Reset packet because
# it doesn't know the connection anymore:
15:32:27.172331 eth0  Out IP fbit-6bf68b4f74-vc2dk.52592 > kubernetes.default.svc.cluster.local.https: Flags [.], ack 1221, win 761, options [nop,nop,TS val 2060115057 ecr 1489545531], length 0
15:32:27.174135 eth0  In  IP kubernetes.default.svc.cluster.local.https > fbit-6bf68b4f74-vc2dk.52592: Flags [R], seq 4202316947, win 0, length 0

As a consequence, fluent-bit recreates the connection and catches up on the events that might have happened in the meantime:

[2025/06/17 15:32:27] [error] [/src/fluent-bit/src/tls/openssl.c:904 errno=104] Connection reset by peer
[2025/06/17 15:32:27] [error] [tls] syscall error: error:00000005:lib(0)::reason(5)
[2025/06/17 15:32:27] [error] [http_client] broken connection to kubernetes.default.svc:443 ?
[2025/06/17 15:32:27] [ warn] [input:kubernetes_events:kubernetes_events.0] kubernetes chunked stream error.
[2025/06/17 15:32:27] [ info] [input:kubernetes_events:kubernetes_events.0] kubernetes stream disconnected, ret=-1
[2025/06/17 15:33:01] [ info] [input:kubernetes_events:kubernetes_events.0] Requesting /api/v1/events?watch=1&resourceVersion=7869278
k8s_events: [1750174158.000000000, {"metadata":{"name":"myevent-1750174158","namespace":"..

multi-io avatar Jun 17 '25 16:06 multi-io