logstash-codec-netflow
logstash-codec-netflow copied to clipboard
Memory leak in Netflow::TemplateRegistry when cache_save_path parameter is not provided
Logstash information:
Version: 7.16.3
JVM (e.g. java -version):
openjdk version "1.8.0_342" OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~18.04-b07) OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)
OS version (uname -a if on a Unix-like system):
Description of the problem including expected versus actual behavior:
TLDR: if cache_save_path is not provided, Netflow::TemplateRegistry does not call do_cleanup which is in charge of cleaning up the Vash memory caches.
In our testing, logstash heap memory usage would continually increase until it would crash with an out of memory exception. This would happen around every four hours in our environment.
Comparing heap dumps within those four hours, we noticed the memory usage of an object grow over 4x. (Right side is baseline, left is dump from oom crash)

Opening up the object, we can determine the class name from the metadata.

We trace this back to the corresponding source code:
https://github.com/logstash-plugins/logstash-codec-netflow/blob/b7df239e97883c1fb3bc1f13f1b0eba6aa4c0fed/lib/logstash/codecs/netflow.rb#L537-L553
In the heap dump screenshot var2 and var5 correspond with the two instances of Vash used in the TemplateRegistry. From our testing, the memory usage of these two objects were continuously growing.
Looking at the Vash implementation, we can see that it requires a manual cleanup call in order to release memory.
https://gist.github.com/joshaven/184837
The Vash object will forget any answer that is requested after the specified TTL. It is a good idea to manually clean things up from time to time because it is possible that you'll cache data but never again access it and therefor it will stay in memory after the TTL has expired. To clean up the Vash object, call the method: cleanup!
In TemplateRegistry, the cleanup call for both Vash objects are made in the TemplateRegistry::do_cleanup method.
https://github.com/logstash-plugins/logstash-codec-netflow/blob/b7df239e97883c1fb3bc1f13f1b0eba6aa4c0fed/lib/logstash/codecs/netflow.rb#L661-L667
do_cleanup is then only ever called in do_persist
https://github.com/logstash-plugins/logstash-codec-netflow/blob/b7df239e97883c1fb3bc1f13f1b0eba6aa4c0fed/lib/logstash/codecs/netflow.rb#L643-L659
However, note that on line 644, if file_path is not provided, then the do_persist function exits early, hence skipping the call to do_cleanup.
file_path can then be traced back to the cache_save_path setting in the initialization of the TemplateRegistry.
https://github.com/logstash-plugins/logstash-codec-netflow/blob/b7df239e97883c1fb3bc1f13f1b0eba6aa4c0fed/lib/logstash/codecs/netflow.rb#L67-L68
Thus we can see that this situation happens when a value is not provided for cache_save_path, setting file_path to nil by default causing do_cleanup to always get skipped.
Steps to reproduce:
Provide logs (if relevant):