fluent-plugin-forest icon indicating copy to clipboard operation
fluent-plugin-forest copied to clipboard

Buffered files not flushed when fluentd is restarted after a failure

Open garthgoodson opened this issue 10 years ago • 16 comments

Forest creates a new output plugin for each tag / path it sees. When using the file buffer the buffered data is not flushed to the output plugin upon startup (if fluentd failed previously), this is because the output plugins are not initialized until an event matching that tag arrives.

To solve this problem, forest should: 1) scan through all the configured buffer_paths, 2) list all the stored buffers in those paths, 3) regenerate the list of output plugins based on the tags found in the filename of those buffered files (e.g., by calling plant(tag)).

garthgoodson avatar Jun 12 '14 23:06 garthgoodson

I've observed this behavior with this plugin and it's causing a tremendous amount of log loss. I'm actually using it on an aggregator host and I've watch files be left behind because of the td-agent subprocess dying.

eredding-rmn avatar Jul 07 '14 03:07 eredding-rmn

@garthgoodson interesting solution, but how would forest know where buffer paths are of its children plants?

pitr avatar Aug 06 '14 14:08 pitr

I'm curious @pitr @tagomoris - what is allowing the fluent-plugin-forest startup procedure that makes it skip the initialization process it normally carries out? This could easily be solved by having the plugin initiate startup in the exact same fashion, I would presume.

eredding-rmn avatar Aug 06 '14 14:08 eredding-rmn

Forest "plants" sub-plugins as it sees new records, so it doesn't know what records it will see (or in the case of this bug, NOT see) in initialization phase.

pitr avatar Aug 06 '14 14:08 pitr

This is definitely difficult problem, but there are some (or more) importance to flush bufferes which are left by previous process (and not be instanciated yet). I have some ideas to solve this problem (but not be easy):

  • Fluentd core provides initialization steps which we can use it to scan left buffer files
    • it requires too big cost to pay in normal start process of forest plugin
      • What are there any things to do when flushes on start-up sequence will fail after a long time?
    • Fluentd now have --without-source option to flush left buffers
    • If we can add hook points to scan-and-flush buffers on start-up sequence of plugin, we can pay any heavy cost to scan buffer_path to be flushed
  • We have some storages to store metadata of buffers to be flushed
    • This requires many more dependencies: that is not acceptable for usual cases...
  • Fluentd provides any KVS for plugins that to be serialized over process lifecycle
    • Of course, that is very difficult to implement

I'm wondering which way is the best way. How do you think about these solutions?

tagomoris avatar Aug 06 '14 16:08 tagomoris

It seems second solution is the only one that doesn't require a change to fluentd. Also, that's the one I'd prefer.

One concern I'd like to bring up is, during initialization, forest plugin should only initialize sub-plugins that did NOT flush properly, not all sub-plugins it ever saw.

Also, we need to be careful with situations where there was a config change and a sub-plugin that didn't have its buffer flushed is no longer defined.

pitr avatar Aug 06 '14 16:08 pitr

We solved this in our code by adding a buffer path to the configuration that we scan on startup. Without this we could not use the plugin. It would be unacceptable to effectively drop buffered data. I'm not sure why this cost would be high.

On Wed, Aug 6, 2014 at 9:06 AM, TAGOMORI Satoshi [email protected] wrote:

This is definitely difficult problem, but there are some (or more) importance to flush bufferes which are left by previous process (and not be instanciated yet). I have some ideas to solve this problem (but not be easy):

  • Fluentd core provides initialization steps which we can use it to scan left buffer files

    • it requires too big cost to pay in normal start process of forest plugin
     - What are there any things to do when flushes on start-up
     sequence will fail after a long time?
    
    • Fluentd now have --without-source option to flush left buffers
      • If we can add hook points to scan-and-flush buffers on start-up sequence of plugin, we can pay any heavy cost to scan buffer_path to be flushed
      • We have some storages to store metadata of buffers to be flushed
      • This requires many more dependencies: that is not acceptable for usual cases...
      • Fluentd provides any KVS for plugins that to be serialized over process lifecycle
      • Of course, that is very difficult to implement

I'm wondering which way is the best way. How do you think about these solutions?

— Reply to this email directly or view it on GitHub https://github.com/tagomoris/fluent-plugin-forest/issues/15#issuecomment-51357239 .

Garth Goodson Natero, Founder www.natero.com | 650.308.9175

garthgoodson avatar Aug 06 '14 16:08 garthgoodson

I'm guessing that while the plugin is scanning the buffer_path, it cannot receive events. Is this correct?If so, say you had 50000 8Mb files left from a plugin that crashed and the forest plugin re-planted or the td-agent as a whole crashed, it could take a long time to drain off these files, all the while blocking the handoff from forest to the planted output.

I think this, as an option, could be acceptable with a big warning sign on it:

  • If we can add hook points to scan-and-flush buffers on start-up sequence of plugin, we can pay any heavy cost to scan buffer_path to be flushed

eredding-rmn avatar Aug 06 '14 17:08 eredding-rmn

@garthgoodson would you mind sharing a diff?

eredding-rmn avatar Aug 06 '14 17:08 eredding-rmn

I think one can minimize the time. Basically, the scan should just cache the filenames locally (it should be pretty fast since no data is read, just file/dir metadata); once that is done the data transfer for those files can begin, and new data can be collected.

On Wed, Aug 6, 2014 at 10:08 AM, Erik Redding [email protected] wrote:

I'm guessing that while the plugin is scanning the buffer_path, it cannot receive events. Is this correct?If so, say you had 50000 8Mb files left from a plugin that crashed and the forest plugin re-planted or the td-agent as a whole crashed, it could take a long time to drain off these files, all the while blocking the handoff from forest to the planted output.

I think this, as an option, could be acceptable with a big warning sign on it:

  • If we can add hook points to scan-and-flush buffers on start-up sequence of plugin, we can pay any heavy cost to scan buffer_path to be flushed

Reply to this email directly or view it on GitHub https://github.com/tagomoris/fluent-plugin-forest/issues/15#issuecomment-51365465 .

Garth Goodson Natero, Founder www.natero.com | 650.308.9175

garthgoodson avatar Aug 06 '14 17:08 garthgoodson

I am running into this problem as well. Has anyone come up with a good solution?

jmoseley avatar Jan 09 '15 22:01 jmoseley

Hmm, I've got an idea to implement buffer_directory_path to fluent-plugin-forest, and forest plugin generates buffer_path parameter automatically on startup of plugin instances. This feature makes us to find and recover all buffer files at the time when fluentd starts.

But I think that this feature seems too magical for many users. How about you all?

tagomoris avatar Jan 11 '15 04:01 tagomoris

@tagomaris- another possibility: the 'forest' plugin could implement one disk-based buffer for all events. When any "child" blocks, forest blocks all events to all children.

LanceNorskog avatar Jan 19 '15 20:01 LanceNorskog

Is this being worked on currently?

BDuelz avatar Oct 12 '15 21:10 BDuelz

This problem is too hard to solve, because of the design of buffer api of Fluentd. And, I'm trying to solve this problem in brand new buffer api design of Fluentd v0.14.

tagomoris avatar Oct 13 '15 00:10 tagomoris

How about this? Have Forest create its own status file which records every tree it plants. At startup, it looks for this file, and, if it finds it, recreates every tree listed there. That way, detecting and flushing any buffers become the responsibility of the other plugins.

Downside is, of course, that your number of trees now only ever goes up.

macdjord avatar Aug 18 '16 15:08 macdjord