fluentd
fluentd copied to clipboard
Fluentd blocked by 'Too many open files' error, triggered by prolonged downstream failure
Describe the bug The bug is that Fluentd appears to try and open all existing buffer files simultaneously before attempting to flush existing logs / accept new logs (create new buffer files).
This is a problem because, if the downstream component (that Fluentd is sending logs to) goes down for a sufficiently long time, then the number of local buffer files will keep increasing until it exceeds the max # of file descriptors (i.e. max # of files that Fluentd can have open simultaneously -- i.e. LimitNOFILE). At this point, Fluentd throws a 'Too many open files' error Eg.
Connection error Errno::EMFILE: Too many open files - getaddrinfo
Connection error Errno::EMFILE: Too many open files - socket(2)
When that occurs, it looks like this is what happens:
- Even if the downstream component eventually recovers, Fluentd will never be able to flush its existing buffered logs! This is because, before it can even get to the flushing stage, it keeps trying, and failing, to open every single buffer file at the same time.
- Similarly, all new logs sent to Fluentd will be lost, even if Fluentd has enough disk space for new buffer files! This is because, before it can even get to the 'create new buffer file' stage, it keeps trying, and failing, to open every single existing buffer file at the same time.
To Reproduce I have a Fluentd agent that sends logs to Kafka. In case Kafka goes down, to prevent loss of logs, I've assigned it 500 GB of disk space for its buffer. Recently, that downstream Kafka went down for a long time (~12 hours I think) -- however, despite having 500 GB to use, Fluentd stopped creating new buffer files after only using 168 GB. Furthermore, it didn't seem to be attempting to flush those local buffer files anymore. In other words, even though Fluentd had enough buffer disk space, it did not seem to create any more buffer files. Furthermore, it apparently could not flush the existing buffered logs -- they were effectively trapped in Fluentd.
Based on the 'Too many open files' error messages, I eventually realized that this was because the number of existing local buffer files (~392,700) was far higher than Fluentd's LimitNOFILE value at the time (65,536).
Thankfully, my system's hard limit (ulimit -Hn
) was even greater than the number of buffer files, so I was able to 'fix' the issue by manually raising Fluentd's LimitNOFILE value.
Other, less fortunate people have apparently had to just give up and throw away those buffered logs (see #1612) to make Fluentd usable again, which of course defeats the whole point of having a buffering mechanism in the first place -- the buffer, which is meant to provide resiliency and prevent log loss, is now actually acting as a liability and effectively causing log loss by blocking Fluentd.
Expected behavior Fluentd should not naively assume that the number of local buffer files will always be < LimitNOFILE, and thus insist on opening every single buffer file at the same time. Instead, for safety, it should open them in batches. Eg. In my case, Fluentd should have opened the first 65,536 buffer files first, tried to process them, then closed them and opened the next 65,536 buffer files, etc. instead of trying to open all 392,700 buffer files in one shot.
Your Environment
- td-agent version: 3.7.0
- Operating system: Ubuntu 18.04.4 LTS
- Kernel version: 4.15.0-88-generic
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days
FYI, I believe this is similar to issue #3341 from the Fluent Bit repo
Hmm, though it vary on configuration, It may be useful if we can estimate the number of file descriptor at some extent with given parameters.
I think this is related to https://github.com/fluent/fluentd/issues/3610 Seeing the time when this was created gives me very little hope to be fixed soon...