fluentd Provide certain procedures for restore of useless backup chunk files due to unrecoverable errors

Provide certain procedures for restore of useless backup chunk files due to unrecoverable errors

Open bysnupy opened this issue 2 years ago • 0 comments

Describe the bug

Description:

This is a kind of documentation bugs, and it's can be an unexpected result of the check backup feature due to unrecoverable errors.

First of all, it can make a host be unstable due to filling tmpfs up with chunk backup files moved by unrecoverable errors For example, it's an available scenario. Actually I had an experience of this case.

1. Keep going to fill the tmpfs or the backup directory with chunk backup files.[0]
2. It make host problem due to some out of space or exhaustion of virtual memory in tmpfs.

To mitigate this issue, we need to remove the backup files after restoring the chunk backup files as usually periodically. But the chunk backup files are binary files, so it's difficult to restore as it is. Additionally there is no certain procedure to restore the chunk backup in the fluentd official docs. As a result, the chunk backup files are useless, just running out meaningless resources of a host, it can cause another trouble.

Q. How to restore backup chunk files due to unrecoverable errors ?
   Could you please provide certain procedures to restore the chunk backup files ?

[0] Handling Unrecoverable Errors

If these kinds of fatal errors occur, Fluentd will abort the chunk immediately and move it into secondary or the backup directory.

To Reproduce

For example,

If you are using output plugin with cloudwatch_logs(v0.14.2+), and until running out the storage placed of backup directory, you keep generating bigger logs than 256kb which is CloudWatch hard limit of message length size. Then you can see out of space/memory(the backup directory is placed in tmpfs) problem at the host running the fluentd agent.

Expected behavior

For mitigating the above problem without any log lost, we need how to restore the chunk backup files before removing them.

check if there is enough resource to save the chunk backup files at the backup directory.
If running out free size, Restore the chunk backup files according to the certain procedures in the fluentd docs before removing them.
Remove the chunk backup files for reclaim the storage/memory size.

Currently, we need the "2." solution.

Your Environment

- Fluentd version:
  1.14.6
- TD Agent version:
  N/A
- Operating system:
  RHEL8
- Kernel version:
  4.18.0-348

Your Configuration

This issue does not depend on a certain fluentd.conf. It depends on [0] specification of the Fleuntd instead of it.

[0] Handling Unrecoverable Errors

If these kinds of fatal errors occur, the Fluentd will abort the chunk immediately and move it into secondary or the backup directory.

Your Error Log

Any kind of Fluent::UnrecoverableError logs are related with this issue.

https://github.com/fluent-plugins-nursery/fluent-plugin-cloudwatch-logs/blob/7287d1ae78b24e3fb74aee8d3830a65ecd89f65d/lib/fluent/plugin/out_cloudwatch_logs.rb#L382

For example, while using cloudwatch_logs as an output plugin, the following error message is shown.

"Log event in #{group_name} is discarded because it is too large: #{event_bytesize} bytes exceeds limit of #{MAX_EVENT_SIZE}"

Additional context

No response

Jul 09 '22 11:07 bysnupy

fluentd fluentd copied to clipboard

Provide certain procedures for restore of useless backup chunk files due to unrecoverable errors

Describe the bug

To Reproduce

Expected behavior

Your Environment

Your Configuration

Your Error Log

Additional context

fluentd
fluentd copied to clipboard