helm-charts My fluentd pods keep restarting

I am noticing my fluentd pods keep restarting. They are collecting the logs and sending them to elasticsearch, so the workflow isn't broken per se, but in the last 13 hours, the fluentd pods have restarted 61 times.

Describe the bug The logs indicate the following:

    [warn]: unexpected error while calling stop on input plugin plugin=Fluent::Plugin::MonitorAgentInput plugin_id="monitor_agent" error_class=ThreadError error="killed thread"
    [warn]: [elasticsearch] failed to flush the buffer. retry_time=1 next_retry_seconds=XXX chunk="XXXXXX" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster
    [warn]: /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/root_agent.rb:291:in `shutdown'

Version of Helm and Kubernetes:

Helm Version:

version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}

Kubernetes Version: 1.18.16

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.16-gke.502", GitCommit:"a2a88ab32201dca596d0cdb116bbba3f765ebd36", GitTreeState:"clean", BuildDate:"2021-03-08T22:06:24Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

Which version of the chart: latest

How to reproduce it (as minimally and precisely as possible):

extraConfigMaps:
  containers.input.conf: |-
    <source>
    @id fluentd-containers.log
    @type tail
    path /var/log/containers/*.log
    pos_file /var/log/containers.log.pos
    tag raw.kubernetes.*
    read_from_head true
    <parse>
    @type multi_format
    <pattern>
    format json
    time_key time
    time_format %Y-%m-%dT%H:%M:%S.%NZ
    </pattern>
    <pattern>
    format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
    time_format %Y-%m-%dT%H:%M:%S.%N%:z
    </pattern>
    </parse>
    </source>

    # Detect exceptions in the log output and forward them as one log entry.
    <match raw.kubernetes.**>
    @id raw.kubernetes
    @type detect_exceptions
    remove_tag_prefix raw
    message log
    stream stream
    multiline_flush_interval 5
    max_bytes 500000
    max_lines 1000
    </match>

    # Concatenate multi-line logs
    <filter **>
    @id filter_concat
    @type concat
    key message
    multiline_end_regexp /\n$/
    separator ""
    </filter>

    # Enriches records with Kubernetes metadata
    <filter kubernetes.**>
    @id filter_kubernetes_metadata
    @type kubernetes_metadata
    </filter>

    # Fixes json fields in Elasticsearch
    <filter kubernetes.**>
    @id filter_parser
    @type parser
    key_name log
    reserve_data true
    remove_key_name_field true
    <parse>
    @type multi_format
    <pattern>
    format json
    </pattern>
    <pattern>
    format none
    </pattern>
    </parse>
    </filter>

    #exclude kube-system
    <match kubernetes.var.log.containers.**kube-system**.log>
    @type null
    </match>

    # Filter to only records with label fluentd=true
    <filter kubernetes.**>
    @type grep
    <regexp>
    key $.kubernetes.labels.fluentd
    pattern true
    </regexp>
    </filter>

    <filter kubernetes.**>
    @type grep
    <exclude>
    key $.kubernetes.container_name
    pattern istio-proxy
    </exclude>
    </filter>

helm upgrade --install fluentd fluentd-elasticsearch-11.9.0.tgz --namespace=logging --values=../logging/fluentd-values.yaml

Apr 14 '21 16:04 rileyhun

Seems elasticsearch is not reachable.

Apr 16 '21 14:04 monotek

It is reachable in the sense that my logs are being sent

Apr 16 '21 15:04 rileyhun

@rileyhun If you're able to view the Kubernetes events in your cluster around the time that your fluentd is restarting, look for an event whose event.involvedObject.name is the name of your fluentd pod. If you're on the latest version of this helm chart, I recently PRed a change such that the liveness probe now writes an error message when it fails... that message will show up on the Kubernetes event's event.message field. You should see something like "Liveness probe failed: Elasticsearch buffers found stuck longer than 300 seconds."

May 06 '21 19:05 Ghazgkull

Also, in that same PR I fixed an issue where the liveness probe would fail during periods of near-zero log shipping. The latest version of the chart fixes that.

May 06 '21 19:05 Ghazgkull

I am using default ECK operator with default setting here apart from setting the connection strings and no matter how I fine tune those I can't evade occasional pod restarts, it looks like buffer settings influence it, I am using intra cluster only setup but still I get it, could you please advise if you have tested it with ECK @Ghazgkull @monotek and if use any fine tuned settings for it. I am on LTS ECK / EKS 1.20.X.

Feb 26 '22 07:02 s7an-it

helm-charts helm-charts copied to clipboard

My fluentd pods keep restarting

helm-charts
helm-charts copied to clipboard