vector icon indicating copy to clipboard operation
vector copied to clipboard

Vector could simulate back pressure when preconfigured memory limits are near

Open jerome-kleinen-kbc-be opened this issue 2 years ago • 4 comments

Current Vector Version

0.15.2

Use-cases

I run vector log pipelines within pods hosted on an openshift environment. These pods have predefined memory limits. When vector tries to allocate more memory than the pod is allowed to use, the OOMKiller kicks in and hard kills the pod, leading to data loss as this is not a clean shutdown. Within each pod, two files are available: one that defines the maximum memory usage, and one that shows the current memory usage. When a predefined percentage is reached, vector could apply back pressure to the Sources to process the in flight data and release those buffers.

Attempted Solutions

Using tools like https://github.com/grosser/preoomkiller it is possible to perform a clean shut down of vector when a percentage of memory use is reached. The downside is that this will restart the pod.

Proposal

Beforing allocating new buffers, vector could check how much memory is already used and check whether allocating an additional buffer would push the memory usage over a predefined percentage of total memory. In this case, vector could simulate backpressure to the Source as if the Sink is unable to process logs fast enough.

References

jerome-kleinen-kbc-be avatar Aug 19 '21 13:08 jerome-kleinen-kbc-be

Also mentioned in discord again here: https://discord.com/channels/742820443487993987/746070604192415834/879421822922260492

jszwedko avatar Aug 23 '21 18:08 jszwedko

Any thoughts about being able to set memory limit through config file?

nhlushak avatar Oct 23 '23 16:10 nhlushak

Any thoughts about being able to set memory limit through config file?

This would be nice. It is just a difficult feature to implement as all memory allocations now become "fallible".

jszwedko avatar Oct 23 '23 21:10 jszwedko

This would be nice. It is just a difficult feature to implement as all memory allocations now become "fallible".

The thing of it is, as we're finding here, all memory allocations already /are/ fallible since the OOM killer can kick in at any time :-) There are no guarantees in a world of limited resources and fluctuating workloads. This is why in many such systems (e.g. databases) often there are strict hard caps on the sizes of every single buffer - e.g. mysql. This is a pain in the ass to configure for users, but also means extremely predictable resource usage.

It seems like you've sort of anticipated this kind of problem by allowing sink buffers to be full and applying backpressure in that case, but you're now finding out that this backpressure isn't enough to prevent memory overconsumption due to a pileup in earlier stages of the pipeline, which must have their own internal / implicit buffers that aren't configurable in a similar way. In our case it's the logfile source that's consuming the vast majority of the memory before we OOM.

That said I understand the difficulty here to transition the system to work that way, and also the user experience impact. It seems that vector already instruments the memory usage of its components for the excellent vector top tool - a compromise could be to have a worker that wakes up periodically and if total usage > x% of threshold, then start to gradually apply backpressure to the pipeline element which is the top consumer of memory until you observe that the usage starts to drop. It's best effort, but should be good enough for most use cases and should approach optimal, and users can tune their safety level by adjusting the %.

makmanalp avatar Apr 23 '24 21:04 makmanalp