helm-charts
helm-charts copied to clipboard
My fluentd pods keep restarting
I am noticing my fluentd pods keep restarting. They are collecting the logs and sending them to elasticsearch, so the workflow isn't broken per se, but in the last 13 hours, the fluentd pods have restarted 61 times.
Describe the bug The logs indicate the following:
[warn]: unexpected error while calling stop on input plugin plugin=Fluent::Plugin::MonitorAgentInput plugin_id="monitor_agent" error_class=ThreadError error="killed thread"
[warn]: [elasticsearch] failed to flush the buffer. retry_time=1 next_retry_seconds=XXX chunk="XXXXXX" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster
[warn]: /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/root_agent.rb:291:in `shutdown'
Version of Helm and Kubernetes:
Helm Version:
version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}
Kubernetes Version: 1.18.16
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-gke.502", GitCommit:"a2a88ab32201dca596d0cdb116bbba3f765ebd36", GitTreeState:"clean", BuildDate:"2021-03-08T22:06:24Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
Which version of the chart: latest
How to reproduce it (as minimally and precisely as possible):
extraConfigMaps:
containers.input.conf: |-
<source>
@id fluentd-containers.log
@type tail
path /var/log/containers/*.log
pos_file /var/log/containers.log.pos
tag raw.kubernetes.*
read_from_head true
<parse>
@type multi_format
<pattern>
format json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</pattern>
<pattern>
format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</pattern>
</parse>
</source>
# Detect exceptions in the log output and forward them as one log entry.
<match raw.kubernetes.**>
@id raw.kubernetes
@type detect_exceptions
remove_tag_prefix raw
message log
stream stream
multiline_flush_interval 5
max_bytes 500000
max_lines 1000
</match>
# Concatenate multi-line logs
<filter **>
@id filter_concat
@type concat
key message
multiline_end_regexp /\n$/
separator ""
</filter>
# Enriches records with Kubernetes metadata
<filter kubernetes.**>
@id filter_kubernetes_metadata
@type kubernetes_metadata
</filter>
# Fixes json fields in Elasticsearch
<filter kubernetes.**>
@id filter_parser
@type parser
key_name log
reserve_data true
remove_key_name_field true
<parse>
@type multi_format
<pattern>
format json
</pattern>
<pattern>
format none
</pattern>
</parse>
</filter>
#exclude kube-system
<match kubernetes.var.log.containers.**kube-system**.log>
@type null
</match>
# Filter to only records with label fluentd=true
<filter kubernetes.**>
@type grep
<regexp>
key $.kubernetes.labels.fluentd
pattern true
</regexp>
</filter>
<filter kubernetes.**>
@type grep
<exclude>
key $.kubernetes.container_name
pattern istio-proxy
</exclude>
</filter>
helm upgrade --install fluentd fluentd-elasticsearch-11.9.0.tgz --namespace=logging --values=../logging/fluentd-values.yaml
Seems elasticsearch is not reachable.
It is reachable in the sense that my logs are being sent
@rileyhun If you're able to view the Kubernetes events in your cluster around the time that your fluentd is restarting, look for an event whose event.involvedObject.name
is the name of your fluentd pod. If you're on the latest version of this helm chart, I recently PRed a change such that the liveness probe now writes an error message when it fails... that message will show up on the Kubernetes event's event.message
field. You should see something like "Liveness probe failed: Elasticsearch buffers found stuck longer than 300 seconds."
Also, in that same PR I fixed an issue where the liveness probe would fail during periods of near-zero log shipping. The latest version of the chart fixes that.
I am using default ECK operator with default setting here apart from setting the connection strings and no matter how I fine tune those I can't evade occasional pod restarts, it looks like buffer settings influence it, I am using intra cluster only setup but still I get it, could you please advise if you have tested it with ECK @Ghazgkull @monotek and if use any fine tuned settings for it. I am on LTS ECK / EKS 1.20.X.