fluent-plugin-scribe
fluent-plugin-scribe copied to clipboard
Severe performance issues
We deployed fluentd to production using this plugin along with the out_redshift plugin.
Even during our initial benchmarks we saw that working with in_scribe
gives far worse results than working with other input methods (like in_forward
, which was giving 18kmsg/sec vs. 1kmsg/sec with in_scribe
). But when we pushed real production traffic with all the plugins setup (during benchmark we used only in_scribe
and out_file
) it just couldn't handle the load (we're talking about ~300msg/sec).
It looks like the culprit is that all the message handling is happening on the same thread as the one that receives the Scribe messages and there is no actual use of Cool.io. So very often the processing gets delayed for some reason, the Scribe server will get a timeout and will stop sending data in until the retry period ends. But even then after a minute or so it dies again.
We worked around this issue by having in_scribe
enqueue all the messages into a Queue and have another thread that will call Engine.emit on the messages in the queue. But this is sub optimal and far from being "production ready".
Object queuing in input plugins not based on Fluentd buffering is weak for crashes, so fixes you mentioned are hard to merge.
We may be able to fix like this to reduce times to call Engine.emit()
, and also to reduce processing time in thrift event handler:
# FluentScribeHandler
def Log(msgs)
bucket = {} # tag -> events(array of [time,record])
time_now = Engine.now
begin
msgs.each { |msg|
record = create_record(msg)
tag = @add_prefix ? @add_prefix + '.' + msg.category : msg.category
bucket[tag] ||= []
bucket[tag].push([time_now,record])
}
bucket.each { |tag,events|
Engine.emit_array(tag, events)
}
return ResultCode::OK
rescue => e
$log.error "unexpected error", :error=>$!.to_s
$log.error_backtrace
return ResultCode::TRY_LATER
end
end
Thoughts?
As mentioned what we did was only a work around and not something that should be the solution.
From what I've seen is that unless you make the Scribe/Thrift server work with Cool.io any solution will be non optimal.
@arikfr would you mind open sourcing your non-production-ready code? We've been running into similar issues.
We switched back to running scribe for input and are using fluentd tail to then move stuff across until we are done transitioning off scribe.
@hfwang we are no longer using Fluent and unfortunately I didn't keep that code.
@hfwang @arikfr so both of you continue to use Scribe? Any reason for not totally switching from Scribe to Fluentd? That would obviate the need for in_scribe altogether.
Sounds like arikfr is no longer using fluentd.
We have numerous legacy systems that continue to emit scribe logs. We don't have the engineering capacity to update everything at once, and as long as our servers don't fall over, it isn't a priority. New development uses fluentd though.
Our situation is pretty much the same as @hfwang described.
I can fix in_scribe
w/ code as I mentioned on https://github.com/fluent/fluent-plugin-scribe/issues/6#issuecomment-23640557.
But I'm not using in_scribe
now, so I cannot test its effects.
@hfwang Can you test fixed code if I push a branch?
Pushed https://github.com/fluent/fluent-plugin-scribe/tree/reduce_emit_times @hfwang Coud you build, install and test this code?
git clone https://github.com/fluent/fluent-plugin-scribe.git
cd fluent-plugin-scribe
git checkout reduce_emit_times
bundle install
bundle rake build
gem install pkg/fluent-plugin-scribe-0.10.13.gem
# or fluent-gem install ...
# or td-agent-gem install ...
I'll take a look at this probably next week... but will do and thanks!