fluentd icon indicating copy to clipboard operation
fluentd copied to clipboard

Log file fills up with error: "incoming chunk is broken" after about 24 hours of heavy traffic using 2.2.1 on linux

Open scumola opened this issue 8 years ago • 67 comments

using the latest td-agent in production. After the agent gets about 24 hours old, it starts complaining and logging in /var/log/td-agent/td-agent.log "incoming chunk is broken". td-agent took up 100% CPU and fills up the disk with these log messages very quickly (about 5G/hr of these log messages with our setup). Had to take td-agent out of production due to this issue as this was unacceptable behavior and affected our production performance. This was on the app-server side and our td-agent configuration is here: http://pastebin.com/gEwKUPX1 We're basically just tailing nginx logs and allowing the app to send rails logs directly to the td-agent socket with failover servers downstream.

scumola avatar Aug 19 '15 16:08 scumola

Another datapoint: we're using the fluent-logger gem (v0.4.3) to send logs to td-agent if that helps.

scumola avatar Aug 19 '15 16:08 scumola

Messages are: 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=34 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=104 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=116

scumola avatar Aug 19 '15 18:08 scumola

Very weird.

https://github.com/fluent/fluentd/blob/d6b2c505f8c59661bbeef30fa590530f2dda1fe9/lib/fluent/plugin/in_forward.rb#L129

Here is the logging code. fluent-logger should send events as an array object like [tag, time, record]. But from your log, your application send integer object to fluentd. How to use fluent-logger in your code?

repeatedly avatar Aug 19 '15 19:08 repeatedly

  def send_data(tag=nil, payload=nil)
    tag = "radish.calls" if tag.blank?
    unless payload.blank?
      begin
        Timeout::timeout(CUSTOM_LOG_TIMEOUT) do
          response = Fluent::Logger.post(tag, payload)
        end
      rescue Timeout::Error
        Rails.logger.info "FluentD Custom log Write timed out  : #{Time.now.utc}"
      rescue Exception => e
        Rails.logger.info "FluentD write exception :#{e.message}"
      end
    end
  end

scumola avatar Aug 19 '15 20:08 scumola

Ok, our developer who wrote the code says that we're using it like this:

require 'fluent-logger'
log = Fluent::Logger::FluentLogger.new(nil, :host=>'localhost', :port=>24224)
unless log.post("myapp.access", {"agent"=>"foo"})
  p log.last_error # You can get last error object via last_error method
end

but he is saying that the error is from td-agent on our app servers not being able to talk downstream to our td-agent server that aggregates all of the logs even though this message is getting logged on the client app servers. Could this be the case? If so, is there a setting that I can use in td-agent.conf that will keep the agent quiet on the app server side if communication is lost to the downstream td-agent server? When we restart the agent on the app servers, they stay happy for an hour or two and then when they get this error, they fill up the td-agent.log file with several GB in less than an hour and this can't happen in production on our app server. Is there a setting to have it more well-behaved in this situation?

scumola avatar Aug 26 '15 15:08 scumola

Since turning off the app->td-agent logging, the log messages have disappeared and td-agent is stable again just forwarding the log-tailing that remains, so we believe that this issue is something inside of td-agent that is breaking after a couple of hours time at high volume and has something to do with the receiving-side of the tcp socket part of td-agent on the app server. Not td-agent->td-agent forwarding.

scumola avatar Aug 26 '15 17:08 scumola

, the log messages have disappeared and td-agent is stable again just forwarding the log-tailing that remains,

It means you use two input plugin, tail and forward, right?

a couple of hours time at high volume and has something to do with the receiving-side of the tcp socket part of td-agent on the app server.

So problem happens at only app(fluent-logger) -> td-agent(in_forward), not td-agent(out_forward) -> td-agent(in_forward)? Could you tell me how much log volume, request/sec and one log size, and what is your machine spec? I want to reproduce this problem on our environment.

repeatedly avatar Aug 27 '15 22:08 repeatedly

Correct. We were sending 50-100 lines per hit in the app. Using ruby/passenger on a very busy site. td-agent would flip out more quickly when the traffic was higher (peak times, td-agent would freak out quicker). The shortest times were about 3-4 hours between restarting the app. Longest times would be around 12 hours. Let me know if I can help out some other way.

scumola avatar Aug 28 '15 03:08 scumola

https://www.evernote.com/l/AClXQKXeFKNIj64x2h1Jh2cvk4a0tHo2VFY We were logging approx 10k log lines per minute (but spread across 40-ish servers) when this was enabled and working.

scumola avatar Aug 28 '15 15:08 scumola

Thanks! I will try to reproduce this problem with similar log volume.

repeatedly avatar Sep 01 '15 02:09 repeatedly

I am now testing fluent-logger-ruby and in_forward combination with 300 logs/sec traffic since tomorrow but there is no error. Does your app send logs to in_forward concurrently? N loggers -> 1 in_forward or 1 logger -> 1 in_forward?

repeatedly avatar Sep 03 '15 08:09 repeatedly

Multiple threads on the same machine to a single tdagent process

On Thu, Sep 3, 2015, 2:57 AM Masahiro Nakagawa [email protected] wrote:

I am now testing fluent-logger-ruby and in_forward combination with 300 logs/sec traffic since tomorrow but there is no error. Does your app send logs to in_forward concurrently? N loggers -> 1 in_forward or 1 logger -> 1 in_forward?

— Reply to this email directly or view it on GitHub https://github.com/fluent/fluentd/issues/660#issuecomment-137383057.

scumola avatar Sep 03 '15 13:09 scumola

Multiple threads on the same machine

Sharing one fluent-logger between threads or each thread has each fluent-logger?

repeatedly avatar Sep 04 '15 09:09 repeatedly

Sharing one.

On Fri, Sep 4, 2015, 3:38 AM Masahiro Nakagawa [email protected] wrote:

Multiple threads on the same machine

Sharing one fluent-logger between threads or each thread has each fluent-logger?

— Reply to this email directly or view it on GitHub https://github.com/fluent/fluentd/issues/660#issuecomment-137690940.

scumola avatar Sep 04 '15 12:09 scumola

I am testing fluent-logger using 6 threads with 300 records/sec in 2 days but above error doesn't happen. I'm still investigating this issue...

repeatedly avatar Sep 09 '15 08:09 repeatedly

Same issue here. Running an application with Phusion Passenger. Using it fluent-logger-ruby as a normal (request/rails) logger for the application. And also using it to write logs with another tag, this uses a class variable in a controller so it is shared between requests (and not threads if I'm correct). Load is about 200 records/sec and getting errors about broken chunks.

AKoetsier avatar Sep 09 '15 14:09 AKoetsier

We have 16 Passenger Apps writing to the same single socket on the same machine.

[deploy@myapp01 ~]$ ps auxw | grep RubyApp
deploy    1349  0.0  1.9 680216 147240 ?       Sl   18:59   0:01 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    1659  0.1  3.9 960148 303968 ?       Sl   18:42   0:11 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    5218  0.0  1.6 677072 126328 ?       Sl   19:55   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    5456  0.0  1.6 748452 129212 ?       Sl   20:10   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    8044  0.0  1.6 677112 122672 ?       Sl   20:26   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy    9929  0.0  1.7 748452 132584 ?       Sl   19:39   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   12707  0.0  1.7 682880 134304 ?       Sl   19:24   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   17563  0.0  1.7 748456 137524 ?       Sl   19:07   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   18803  0.0  2.1 752568 165676 ?       Sl   18:52   0:02 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   21917  0.3  3.2 884676 246792 ?       Sl   18:34   0:26 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   23135  0.0  1.6 682920 128408 ?       Sl   20:18   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   24635  0.0  1.6 678052 125280 ?       Sl   20:02   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   24710  0.0  1.6 678164 127584 ?       Sl   19:48   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   27315  0.0  1.7 682136 136072 ?       Sl   19:15   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   29626  0.3  1.6 681920 126520 ?       Sl   20:33   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   30958  0.0  1.7 682920 133560 ?       Sl   19:31   0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)                                 
deploy   31651  0.0  0.0 103252   836 pts/0    S+   20:34   0:00 grep RubyApp

scumola avatar Sep 09 '15 20:09 scumola

Hmm... I need to setup Passenger environment. Could you show me Passenger configuration and version? I'm not familiar with Passenger so I want to know actual example.

repeatedly avatar Sep 10 '15 08:09 repeatedly

https://www.phusionpassenger.com/documentation/Users%20guide%20Nginx.html

scumola avatar Sep 10 '15 16:09 scumola

Thanks! I will try it.

repeatedly avatar Sep 14 '15 12:09 repeatedly

Any followup on this issue?

scumola avatar Sep 21 '15 19:09 scumola

Sorry, I'm now busy with other tasks until this weekend. I will try this in the next week.

repeatedly avatar Sep 24 '15 11:09 repeatedly

I setup nginx with passenger on my EC2 environment and use official demo app https://github.com/phusion/passenger-ruby-rails-demo . Now I am sending 250 requests/sec to this app. I hope reproduce this issue...

repeatedly avatar Oct 07 '15 11:10 repeatedly

You need to be also sending logs to fluentd from within the app using that gem. Is your app sending logs?

scumola avatar Oct 07 '15 13:10 scumola

Yes. Raills app sends request.params with additional fields to td-agent's in_forward using fluent-logger 0.5.0. The number of workers is 4.

repeatedly avatar Oct 07 '15 14:10 repeatedly

Hmm... I can't reproduce this problem in two days. In passenger, how to launch multiple passenger apps? Setup multiple server in nginx.conf?

repeatedly avatar Oct 09 '15 08:10 repeatedly

We only use a single instance. Maybe the trick is that we use a class variable to cache the connection between requests. So in a model I have:

class TestClass
  def self.connection
    return @connection if defined? @connection
    fluent_config = YAML.load(ERB.new(Rails.root.join("config", "fluent-logger.yml").read).result)
    settings = {
        host: fluent_config['fluent_host'],
        port: fluent_config['fluent_port'],
        messages_type: 'string'
    }
    @connection = Fluent::Logger::FluentLogger.new(tag_prefix, settings)
  end

  def save
    self.class.connection.post(self.characteristic, self.serializable_hash)
  end

  ...
end

AKoetsier avatar Oct 12 '15 07:10 AKoetsier

We have the same problem. I send 3340095 at 10 minutes(~6000/s). after a few minutes I found very much logs:

2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=50
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=49
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=48
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=44
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=123
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=99
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=105
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=100
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=58
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=50
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=44
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=115
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=105
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=100
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=58
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=49
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=51
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=57

breath-co2 avatar Oct 13 '15 09:10 breath-co2

@breath-co2 Do you use passenger or plain Ruby app?

repeatedly avatar Oct 13 '15 09:10 repeatedly

@repeatedly I use php td-agent client. td-agent version is 0.12.15. td-agent config is like:

<source>
  type forward
  port 12400
</source>

<match **>
  type copy
  <store>
    type file
    ......
  </store>
  <store>
    type aws_s3
    ......
  </store>
  <store>
      type webhdfs
      ......
  </store>
</match>

breath-co2 avatar Oct 13 '15 09:10 breath-co2