fluentd-kubernetes-sumologic icon indicating copy to clipboard operation
fluentd-kubernetes-sumologic copied to clipboard

Log lines merged to single message

Open chris-vest opened this issue 6 years ago • 8 comments

In certain occasions the logs coming from our services are not correctly parsed. As a consequence, thousands of log lines are crammed together in one big message sent to Sumo Logic. Because the message is so large, it overflows Sumo Logic limit for an individual message and the content is split into multiple messages. As a result, it loses its JSON formatting and it becomes "invisible" for many queries that work under the assumption that the log entries are structured JSON.

This seems to be occurring only with Kafka Connect; when certain operations are performed it logs very noisily.

I have noticed that sumologic-fluentd is being OOM killed. I'm wondering if these log lines are being stuffed into a single message in order to 'catch up' after sumologic-fluentd is restarted?

chris-vest avatar Jan 14 '19 12:01 chris-vest

Hi @chris-vest , Thanks for reporting this issue. It sounds like the multiline detection creating this issue. What's the setting of MULTILINE_START_REGEXP and is there any sample log lines failed to be parsed?

bin3377 avatar Jan 22 '19 19:01 bin3377

@bin3377

MULTILINE_START_REGEXP = /^\w{3} \d{1,2}, \d{4}/

Example log: https://pastebin.com/SPr2rxFZ

Just to clarify, this only seems to happen after we enable a new connector - this causes Connect to log a huge amount of lines very quickly, which seems to overload fluentd somehow - causing the massive, squashed log message to end up in SumoLogic. Restarting the fluentd pods then fixes this issue.

chris-vest avatar Jan 30 '19 15:01 chris-vest

I think there are still something not very clear. After connect the log to a "huge lines", how it looks like? each line is a valid json and with \n at end? On the other side, we using fluentd memory buffer by default when streaming out records to Sumo backend. So there are 8MB size limitation of trunk. And Sumo HTTP endpoint has suggestion size of 100KB to 1MB. So I may suggest never generate a single payload over 1MB

bin3377 avatar Feb 13 '19 19:02 bin3377

So Kafka Connect is logging very frequently after a new Connector is configured, which ends up overloading fluentd. The resulting log in SumoLogic looks like this - https://pastebin.com/SPr2rxFZ

The initial logs from Kafka Connect are no different to the logs which it normally creates, the only difference is the frequency of those logs.

chris-vest avatar Feb 14 '19 10:02 chris-vest

I am still seeing this issue, this is what I'm getting from the sumologic-fluentd pods:

sumologic-fluentd-sumologic-fluentd-m8d54 sumologic-fluentd-sumologic-fluentd 2019-05-13 09:42:52 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferChunkOverflowError error="a 13726192bytes record is larger than buffer chunk limit size" location="/var/lib/gems/2.3.0/gems/fluentd-1.2.6/lib/fluent/plugin/buffer.rb:677:in `block in write_step_by_step'" tag="containers.mnt.log.containers.msk-connect-7c74548dcf-jrv7g_strimzi_msk-connect-c2c8a2ddf227310d7f3cf27ac8ff38c7188e025292e99c42a60d8a57c35e1434.log"

chris-vest avatar May 14 '19 07:05 chris-vest

@bin3377 I've upgraded to 2.3.0 to see if this is mitigated at all.

Looking at the buffer settings, I'm a little confused as to where to put them when configuring fluentd? I know I need to set the chunk_limit_size, but which file should that go in? sumo.out.conf?

chris-vest avatar May 14 '19 09:05 chris-vest

@bin3377 I've tried adding this:

        out.sumo.buffer.conf: |-
          <match **>
            @type sumologic
            log_key "log"
            endpoint xxxxxx
            verify_ssl true
            log_format "json"
            flush_interval 5s
            num_threads 1
            open_timeout 60
            add_timestamp true
            timestamp_key "timestamp"
            proxy_uri ""
            <buffer>
              chunk_limit_size 1m
              flush_thread_count 1
              flush_interval 5s
            </buffer>
          </match>

But I'm still seeing log lines merged into single, large log messages in Sumologic - once this happens, fluentd stops sending messages to Sumologic.

chris-vest avatar Jun 07 '19 14:06 chris-vest

I think one of your original comments was correct, in that when I cause the event in my cluster, and the application logs busily, the logs are no longer picked up by the regex. However, shouldn't setting the chunk_limit_size mitigate crashing fluentd?

chris-vest avatar Jun 07 '19 15:06 chris-vest