fluentd
fluentd copied to clipboard
Log file fills up with error: "incoming chunk is broken" after about 24 hours of heavy traffic using 2.2.1 on linux
using the latest td-agent in production. After the agent gets about 24 hours old, it starts complaining and logging in /var/log/td-agent/td-agent.log "incoming chunk is broken". td-agent took up 100% CPU and fills up the disk with these log messages very quickly (about 5G/hr of these log messages with our setup). Had to take td-agent out of production due to this issue as this was unacceptable behavior and affected our production performance. This was on the app-server side and our td-agent configuration is here: http://pastebin.com/gEwKUPX1 We're basically just tailing nginx logs and allowing the app to send rails logs directly to the td-agent socket with failover servers downstream.
Another datapoint: we're using the fluent-logger gem (v0.4.3) to send logs to td-agent if that helps.
Messages are: 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=34 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=104 2015-08-19 18:56:37 +0000 [warn]: incoming chunk is broken: source="host: 127.0.0.1, addr: 127.0.0.1, port: 50157" msg=116
Very weird.
https://github.com/fluent/fluentd/blob/d6b2c505f8c59661bbeef30fa590530f2dda1fe9/lib/fluent/plugin/in_forward.rb#L129
Here is the logging code. fluent-logger should send events as an array object like [tag, time, record]
.
But from your log, your application send integer object to fluentd.
How to use fluent-logger in your code?
def send_data(tag=nil, payload=nil)
tag = "radish.calls" if tag.blank?
unless payload.blank?
begin
Timeout::timeout(CUSTOM_LOG_TIMEOUT) do
response = Fluent::Logger.post(tag, payload)
end
rescue Timeout::Error
Rails.logger.info "FluentD Custom log Write timed out : #{Time.now.utc}"
rescue Exception => e
Rails.logger.info "FluentD write exception :#{e.message}"
end
end
end
Ok, our developer who wrote the code says that we're using it like this:
require 'fluent-logger'
log = Fluent::Logger::FluentLogger.new(nil, :host=>'localhost', :port=>24224)
unless log.post("myapp.access", {"agent"=>"foo"})
p log.last_error # You can get last error object via last_error method
end
but he is saying that the error is from td-agent on our app servers not being able to talk downstream to our td-agent server that aggregates all of the logs even though this message is getting logged on the client app servers. Could this be the case? If so, is there a setting that I can use in td-agent.conf that will keep the agent quiet on the app server side if communication is lost to the downstream td-agent server? When we restart the agent on the app servers, they stay happy for an hour or two and then when they get this error, they fill up the td-agent.log file with several GB in less than an hour and this can't happen in production on our app server. Is there a setting to have it more well-behaved in this situation?
Since turning off the app->td-agent logging, the log messages have disappeared and td-agent is stable again just forwarding the log-tailing that remains, so we believe that this issue is something inside of td-agent that is breaking after a couple of hours time at high volume and has something to do with the receiving-side of the tcp socket part of td-agent on the app server. Not td-agent->td-agent forwarding.
, the log messages have disappeared and td-agent is stable again just forwarding the log-tailing that remains,
It means you use two input plugin, tail and forward, right?
a couple of hours time at high volume and has something to do with the receiving-side of the tcp socket part of td-agent on the app server.
So problem happens at only app(fluent-logger) -> td-agent(in_forward), not td-agent(out_forward) -> td-agent(in_forward)? Could you tell me how much log volume, request/sec and one log size, and what is your machine spec? I want to reproduce this problem on our environment.
Correct. We were sending 50-100 lines per hit in the app. Using ruby/passenger on a very busy site. td-agent would flip out more quickly when the traffic was higher (peak times, td-agent would freak out quicker). The shortest times were about 3-4 hours between restarting the app. Longest times would be around 12 hours. Let me know if I can help out some other way.
https://www.evernote.com/l/AClXQKXeFKNIj64x2h1Jh2cvk4a0tHo2VFY We were logging approx 10k log lines per minute (but spread across 40-ish servers) when this was enabled and working.
Thanks! I will try to reproduce this problem with similar log volume.
I am now testing fluent-logger-ruby and in_forward combination with 300 logs/sec traffic since tomorrow but there is no error. Does your app send logs to in_forward concurrently? N loggers -> 1 in_forward or 1 logger -> 1 in_forward?
Multiple threads on the same machine to a single tdagent process
On Thu, Sep 3, 2015, 2:57 AM Masahiro Nakagawa [email protected] wrote:
I am now testing fluent-logger-ruby and in_forward combination with 300 logs/sec traffic since tomorrow but there is no error. Does your app send logs to in_forward concurrently? N loggers -> 1 in_forward or 1 logger -> 1 in_forward?
— Reply to this email directly or view it on GitHub https://github.com/fluent/fluentd/issues/660#issuecomment-137383057.
Multiple threads on the same machine
Sharing one fluent-logger between threads or each thread has each fluent-logger?
Sharing one.
On Fri, Sep 4, 2015, 3:38 AM Masahiro Nakagawa [email protected] wrote:
Multiple threads on the same machine
Sharing one fluent-logger between threads or each thread has each fluent-logger?
— Reply to this email directly or view it on GitHub https://github.com/fluent/fluentd/issues/660#issuecomment-137690940.
I am testing fluent-logger using 6 threads with 300 records/sec in 2 days but above error doesn't happen. I'm still investigating this issue...
Same issue here. Running an application with Phusion Passenger. Using it fluent-logger-ruby as a normal (request/rails) logger for the application. And also using it to write logs with another tag, this uses a class variable in a controller so it is shared between requests (and not threads if I'm correct). Load is about 200 records/sec and getting errors about broken chunks.
We have 16 Passenger Apps writing to the same single socket on the same machine.
[deploy@myapp01 ~]$ ps auxw | grep RubyApp
deploy 1349 0.0 1.9 680216 147240 ? Sl 18:59 0:01 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 1659 0.1 3.9 960148 303968 ? Sl 18:42 0:11 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 5218 0.0 1.6 677072 126328 ? Sl 19:55 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 5456 0.0 1.6 748452 129212 ? Sl 20:10 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 8044 0.0 1.6 677112 122672 ? Sl 20:26 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 9929 0.0 1.7 748452 132584 ? Sl 19:39 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 12707 0.0 1.7 682880 134304 ? Sl 19:24 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 17563 0.0 1.7 748456 137524 ? Sl 19:07 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 18803 0.0 2.1 752568 165676 ? Sl 18:52 0:02 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 21917 0.3 3.2 884676 246792 ? Sl 18:34 0:26 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 23135 0.0 1.6 682920 128408 ? Sl 20:18 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 24635 0.0 1.6 678052 125280 ? Sl 20:02 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 24710 0.0 1.6 678164 127584 ? Sl 19:48 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 27315 0.0 1.7 682136 136072 ? Sl 19:15 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 29626 0.3 1.6 681920 126520 ? Sl 20:33 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 30958 0.0 1.7 682920 133560 ? Sl 19:31 0:00 Passenger RubyApp: /mnt/srv/rails/radish/current/public (production)
deploy 31651 0.0 0.0 103252 836 pts/0 S+ 20:34 0:00 grep RubyApp
Hmm... I need to setup Passenger environment. Could you show me Passenger configuration and version? I'm not familiar with Passenger so I want to know actual example.
https://www.phusionpassenger.com/documentation/Users%20guide%20Nginx.html
Thanks! I will try it.
Any followup on this issue?
Sorry, I'm now busy with other tasks until this weekend. I will try this in the next week.
I setup nginx with passenger on my EC2 environment and use official demo app https://github.com/phusion/passenger-ruby-rails-demo . Now I am sending 250 requests/sec to this app. I hope reproduce this issue...
You need to be also sending logs to fluentd from within the app using that gem. Is your app sending logs?
Yes. Raills app sends request.params with additional fields to td-agent's in_forward using fluent-logger 0.5.0. The number of workers is 4.
Hmm... I can't reproduce this problem in two days. In passenger, how to launch multiple passenger apps? Setup multiple server in nginx.conf?
We only use a single instance. Maybe the trick is that we use a class variable to cache the connection between requests. So in a model I have:
class TestClass
def self.connection
return @connection if defined? @connection
fluent_config = YAML.load(ERB.new(Rails.root.join("config", "fluent-logger.yml").read).result)
settings = {
host: fluent_config['fluent_host'],
port: fluent_config['fluent_port'],
messages_type: 'string'
}
@connection = Fluent::Logger::FluentLogger.new(tag_prefix, settings)
end
def save
self.class.connection.post(self.characteristic, self.serializable_hash)
end
...
end
We have the same problem. I send 3340095 at 10 minutes(~6000/s). after a few minutes I found very much logs:
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=50
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=49
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=55
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=48
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=44
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=123
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=99
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=105
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=100
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=58
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=50
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=44
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=115
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=105
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=100
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=34
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=58
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=49
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=51
2015-10-13 17:06:18 +0800 [warn]: incoming chunk is broken: source="host: 10.1.9.179, addr: 10.1.9.179, port: 24224" msg=57
@breath-co2 Do you use passenger or plain Ruby app?
@repeatedly I use php td-agent client. td-agent version is 0.12.15. td-agent config is like:
<source>
type forward
port 12400
</source>
<match **>
type copy
<store>
type file
......
</store>
<store>
type aws_s3
......
</store>
<store>
type webhdfs
......
</store>
</match>