fluent-bit
fluent-bit copied to clipboard
When ingestion endpoint is not reachable : health endpoint should return 5xx HTTP error.
$kubectl version
Client Version: v1.30.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.5+IKS
Fluent Bit v3.1.4-ibm
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io/
______ _ _ ______ _ _ _____ __
| ___| | | | | ___ (_) | |____ |/ |
| |_ | |_ _ ___ _ __ | |_ | |_/ /_| |_ __ __ / /`| |
| _| | | | | |/ _ \ '_ \| __| | ___ \ | __| \ \ / / \ \ | |
| | | | |_| | __/ | | | |_ | |_/ / | |_ \ V /.___/ /_| |_
\_| |_|\__,_|\___|_| |_|\__| \____/|_|\__| \_/ \____(_)___/
Registering the logger-agent-plugin CommitSHA: e3e664b3cde6cd9f120036d6767cd0717f546b12
Registering the logger-icl-output-plugin with commitSHA: c257a37dc8119d8906e1be191998ed8d4a4beb3c
[2024/10/14 13:52:52] [ info] [fluent bit] version=3.1.4-ibm, commit=, pid=1
[2024/10/14 13:52:52] [ info] [storage] ver=1.5.2, type=memory+filesystem, sync=normal, checksum=off, max_chunks_up=192
[2024/10/14 13:52:52] [ info] [storage] backlog input plugin: storage_backlog.1
[2024/10/14 13:52:52] [ info] [cmetrics] version=0.9.1
[2024/10/14 13:52:52] [ info] [ctraces ] version=0.5.2
I have the fluentbit deamon set running in my K8s and I can enter the logging pod and see:
bash-5.1$ ps -Af
UID PID PPID C STIME TTY TIME CMD
10000 1 0 1 Oct14 ? 00:28:54 /fluent-bit/bin/fluent-bit --config=/fluent-bit/etc/fluent-bit.conf
10000 33 0 0 14:34 pts/0 00:00:00 /bin/bash
10000 44 33 0 14:34 pts/0 00:00:00 ps -Af
K8s pod config:
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/v1/health/
port: 8081
scheme: HTTP
yet the configuration is bad or firewall blocks the ingestion point so I get bad readiness.
If I ssh into the POD:
bash-5.1$ curl localhost:8081/api/v1/health
curl: (7) Failed to connect to localhost port 8081: Connection refused
this is misleading response.
if the process is up it should return 500 or alike and not Connection refused for that health endpoint.
possible to add a an HTTP reason header or log line about the true nature of config issue.
connection refused is for severe cases where process fails to start due to null pointer exception or process crashing due to OOM.
Please follow the template and provide all the relevant details required including config, version, environment, etc.?
I presume you're using this? https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit
based on fluentbit documentaiton the health point should:
The health endpoint returns an HTTP status 500 and an error message. Otherwise, the endpoint returns HTTP status 200 and an ok message.
deamon set:
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
version: 1.3.1
creationTimestamp: "2024-03-17T11:24:21Z"
generation: 57
labels:
app: logger-agent-ds
version: 1.3.1
name: logger-agent-ds
namespace: ibm-observe
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
name: logger-agent-ds
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2024-09-12T11:13:58Z"
creationTimestamp: null
labels:
app: logger-agent-ds
name: logger-agent-ds
version: 1.3.1
spec:
containers:
- args:
- --config=/fluent-bit/etc/fluent-bit.conf
command:
- /fluent-bit/bin/fluent-bit
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: observe/logs-router-agent:1.3.1
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/v1/health/
port: 8081
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 1
name: fluent-bit
ports:
- containerPort: 2020
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/v1/health/
port: 8081
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 20
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 701m
ephemeral-storage: 10Gi
memory: 3Gi
requests:
cpu: 100m
ephemeral-storage: 2Gi
memory: 1Gi
securityContext:
capabilities:
add:
- DAC_READ_SEARCH
privileged: false
runAsGroup: 10000
runAsUser: 10000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/tokens
name: vault-token
- mountPath: /var/log
name: varlog
readOnly: true
- mountPath: /var/data
name: vardata
readOnly: true
- mountPath: /var/log/fluent-bit
name: varlogfluentbit
- mountPath: /var/lib/docker/containers
name: varlibdockercontainers
readOnly: true
- mountPath: /fluent-bit/etc/
name: logger-agent-config
- mountPath: /fluent-bit/cache
name: fluent-bit-cache
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: all-icr-io
initContainers:
- command:
- scripts/make_db_dir.sh
image: observe/logs-router-agent-init:1.3.1
imagePullPolicy: Always
name: create-db-dir
resources: {}
securityContext:
privileged: true
runAsGroup: 0
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/log
name: varlog
- mountPath: /var/data
name: vardata
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: logger-agent-sa
serviceAccountName: logger-agent-sa
terminationGracePeriodSeconds: 10
tolerations:
- operator: Exists
volumes:
- name: vault-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: iam
expirationSeconds: 7200
path: vault-token
- hostPath:
path: /var/log
type: ""
name: varlog
- hostPath:
path: /var/data
type: ""
name: vardata
- hostPath:
path: /var/log/fluent-bit
type: ""
name: varlogfluentbit
- hostPath:
path: /var/lib/docker/containers
type: ""
name: varlibdockercontainers
- configMap:
defaultMode: 420
name: logger-agent-config
name: logger-agent-config
- emptyDir:
sizeLimit: 11Gi
name: fluent-bit-cache
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.