opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
[receiver/filelogreceiver] Add maxSurgeSize
Description: <Describe what has changed.>
Adding a maximum surge size parameter to the stanza/fileconsumer reader. The primary goal is to prevent lagging behind a log file whose most recent entries have grown beyond the maxSurgeSize. For example, if 5 MB of data have been written to the log since the last read and the maxSurgeSize is set to 5 MB, we want to skip ahead and set the offset to the end of the file. See the image below for reference
Link to tracking Issue: <Issue number if applicable>
Testing:
2 Unit tests in stanza/fileconsumer reader
Documentation: <Describe the documentation added.>
Extended the readme of filelogreceiver
Questions:
- Would this affect the reading mechanism when the file is compressed?
- Should I add this behind feature gate?
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: Dainerx / name: Oussama Ben Ghorbel (Dainerx) (94af6ccdc18d4983b17dce4e49dba9f1678e6196, d300064143413614271d109cddbba0d076c9866f, b3725d7f593a7164c77edb3b1e18f2440ce1dd59, c463e6e0544c16fae5e87ff20c1c33e571367732)
If I'm understanding correctly, the idea is to provide a mechanism which would trigger skipping over logs to reach the end of the file immediately. Can you give me more details about this use case where this is needed?
@djaglowski yes correct.
A use case is as follow:
Given a filelogreceiver with poll_interval of 10s and max_surge_size of 10Mb, watching myservice.log.
t1=10s: 2Kb were written // ok t2=10s: 5Kb were written // ok t3=10s: 20Mb were written // This is too much log data and we want to skip it t4 = 10s: 1Mb were written // ok
@djaglowski yes correct.
A use case is as follow:
Given a filelogreceiver with poll_interval of 10s and max_surge_size of 10Mb, watching
myservice.log.t1=10s: 2Kb were written // ok t2=10s: 5Kb were written // ok t3=10s: 20Mb were written // This is too much log data and we want to skip it t4 = 10s: 1Mb were written // ok
I would like to understand the reason why there is too much log data. Typically a burst of data could indicate a problem, which I would think users want to understand fully. Alternately they can tune down the amount of logs generated or right size the collector to consume more data and then filter out that which isn't interesting.
Hello @djaglowski This is a protection mechanism. We had the use case in the past of specific log files increasing by hundreds of MB in a few seconds, because of wrongly configured loggers or badly implemented retry mechanisms... We know this will happen again. This optional feature aims at protecting the collector from this use case, in terms of CPU, memory and latencies as we target a near real time monitoring
@overmeulen, thanks for explaining further. I can appreciate the motivation but I'm not convinced this is the approach we should provide.
Here are a few reasons I don't like the idea of skipping directly to the end of a file:
- Skipping directly to the end of file provides no guarantee that the new starting point aligns with the start of a log. This feels quite crude and may have unintended impacts downstream.
- This approach also feels crude in the sense that even if we were to skip ahead, wouldn't the user still want to read some of the most recent logs?
- File rotation settings can be used to constrain the amount of data preserved. Even if they are not used specifically for this purpose, skipping to the end of a file can easily be ineffective because of file rotation settings. For example, if your skip ahead limit is 5 MB, then any file rotation config which rotates at less than 5 MB can still easily bypass the protective benefit of this design. e.g. 100 x 1 MB files will not be limited by this mechanism.
- Often when there is a surge of logs it is the result of a problem occurring, in which case this is exactly the scenario where I would most want to ensure all logs are delivered. Constraining the load this places on a system is an orthogonal concern.
- I can't find any other examples of logging agents which rely on this mechanism. I've checked FluentBit, vector, elastic beats, but please let me know if I missed an example.
Ultimately, I would be much more receptive to placing limits on the number or total size of logs read during a poll interval. This should effectively limit the load placed on the system when processing bursts while also avoiding loss of data. If there is a situation where this collector generally cannot keep up with the amount of logs written, that is a sizing issue and I don't believe we should try to solve it here.
__
Ultimately, I would be much more receptive to placing limits on the number or total size of logs read during a poll interval.
Setting a limit for the total entries to be read in one poll, will just leave the entries for the next poll. Hence, you'll lag behind ever further. Would you accept just a warning when we the limit is reached?
entriesSize = entriesSize + len(s.Bytes())
if r.maxSurgeSize > 0 && entriesSize >= r.maxSurgeSize {
r.set.Logger.Warn("Log entries' size exceeded max surge size")
}
Hence, you'll lag behind ever further.
Right, but this is an indication of an ongoing sizing issue, not a one-off exception that we should handle.
Would you accept just a warning when we the limit is reached?
How about a warning that prints when a "max logs per poll" limit is reached?
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Closed as inactive. Feel free to reopen if this PR is still being worked on.