fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

There is no metric that represents errors that are not retries

Open amit-mazor opened this issue 1 year ago • 0 comments

Describe the bug

There are two different metrics that I use for monitoring Fluentd output behavior:

fluentd_status_retry_count fluentd_output_status_num_errors I noticed that they both show always the exact same value, meaning each retry = an error. Although there might be some retries, it doesn't mean by my opinion, that it's immediately an error. The output destination might be loaded or might have some issues that causes Fluentd to retry, but there is no way to separate when there is a retry, which is fine in my case, and when there is an error - for example the destination server is down.

To test this issue and check if I can understand whether there is an actual error in my output destination or there are only a few retries, I scaled down the deployment of the dest, meaning no logs cant be sent, and set retry limit to 5. What I saw, was that for each retry, there was an error, and after 5 retries, both metrics stopped going up. What I expected to see, was errors > retries, because the 5 retries reached their limit, but the error was still there, the dest was down, no log could be accepted.

The issue here , is that I dont have any clear way to figure out whether my logs are just getting retried, or is there an actual continuing problem that no retry can solve. Currently, Im getting alerts on 'errors', while these errors are only some retries, and I see in Fluentd' logs that these chunks were successfully sent after some retries. So this is not an actual error, I want to be able to monitor an actual error that prevents all of the logs from being sent.

To Reproduce

Configure an endpoint that is unreachable, an output destination that cannot actually be reached, and set retry limit on a final number, like 5. Then watch the metrics.

Expected behavior

I expect to see more errors than retries, and not retries==errors, because an error is still happening while the retries reached their max.

Your Environment

- Fluentd version: 1.14.6
- Running a dockerfile with the following base image: v1.14.6-debian-forward-1.0
- Running on Kubernetes v1.21.7 with a Fluentd daemonset

Your Configuration

config: 
    service: |-
      [SERVICE]
          Daemon Off
          Flush 1
          Log_Level {{.Values.logLevel}}
          Parsers_File parsers.conf
          Parsers_File custom_parsers.conf
          HTTP_Server On
          HTTP_Listen 0.0.0.0
          HTTP_Port {{.Values.service.port}}
          Health_Check On
          storage.metrics on
    inputs: |-
      [INPUT]
          Name tail
          Path /var/log/containers/*.log
          multiline.parser docker, cri
          Tag kube.*
          Refresh_Interval 5
          Skip_Long_Lines On
          Mem_Buf_Limit 25MB
          DB /var/log/fluentbit-tail.db
      @INCLUDE input-systemd.conf
    filters: |-
      [FILTER]
          Name kubernetes
          Match kube.*
          K8S-Logging.Parser On
          K8S-Logging.Exclude On
          Use_Kubelet On
          Annotations Off
          Labels On
          Buffer_Size 0
          Keep_Log Off
          Merge_Log_Key log_obj
          Merge_Log On
      [FILTER]
          Name            nest
          Match           kube.*
          Operation       lift
          Nested_under    kubernetes
          Add_prefix      kubernetes.
      [FILTER]
          Name    modify
          Match   kube.*
          Copy    ${APP_NAME} applicationName
          Copy    ${SUB_SYSTEM} subsystemName 
      [FILTER]
          Name            nest
          Match           kube.*
          Operation       nest
          Wildcard        kubernetes.*
          Nest_under      kubernetes
          Remove_prefix   kubernetes.
      [FILTER]
          Name        nest
          Match       kube.*
          Operation   nest
          Wildcard    kubernetes
          Wildcard    log
          Wildcard    log_obj
          Wildcard    stream
          Wildcard    time 
          Nest_under  json 
      @INCLUDE filters-systemd.conf
    outputs: |-
      [OUTPUT]
          Name                  http
          Match                 kube.*
          Host                  ${ENDPOINT}
          Port                  443
          URI                   /logs/rest/singles
          Format                json_lines
          TLS                   On
          Header                private_key ${PRIVATE_KEY}
          compress              gzip
          Retry_Limit           False
      @INCLUDE output-systemd.conf
    
    extraFiles:
      input-systemd.conf: |-    
        [INPUT]
          Name systemd
          Tag host.*
          Systemd_Filter _SYSTEMD_UNIT=kubelet.service
          Read_From_Tail On 
          Mem_Buf_Limit 5MB  
      filters-systemd.conf: |-
        [FILTER]
          Name    modify
          Match   host.*
          Add    applicationName ${APP_NAME_SYSTEMD} 
          Add    subsystemName ${SUB_SYSTEM_SYSTEMD} 
        [FILTER]
          Name        nest
          Match       host.*
          Operation   nest
          Wildcard    _HOSTNAME
          Wildcard    SYSLOG_IDENTIFIER
          Wildcard    _CMDLINE 
          Wildcard    MESSAGE
          Nest_under  json 
      output-systemd.conf: |-
        [OUTPUT]
          Name                  http
          Match                 host.*
          Host                  ${ENDPOINT}
          Port                  443
          URI                   /logs/rest/singles
          Format                json_lines
          TLS                   On
          Header                private_key ${PRIVATE_KEY}
          compress              gzip
          Retry_Limit           10

Your Error Log

The metrics:
* fluentd_status_retry_count
* fluentd_output_status_num_errors
show the exact same value

Additional context

image

amit-mazor avatar Sep 11 '22 15:09 amit-mazor