BufferedTokenizer may silently drop data when oversize input has no delimiters
Logstash information:
Please include the following information:
- Logstash version (e.g.
bin/logstash --version):9.2.0-SNAPSHOT - Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker) N/A
- How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes) N/A
Description of the problem including expected versus actual behavior:
In https://github.com/elastic/logstash/pull/17229 changes to the buffered tokenizer introduce an edge-case where an oversized sequence of bytes that do not include a delimiter can be silently dropped.
This occurs because size validation happens after the token is fully accumulated, which does not occur if no trailing separator is detected.
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including (e.g.) pipeline definition(s), settings, locale, etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.
Provide logs (if relevant):
Plugins that use the json_lines codec:
- stdin
- tcp input
- http input
- logstash input integration
- logstash input elastic_serverless_forwarder
Of these, logstash integration and esf input don't support configuring the decode size limit which defaults to 512MB.
Example configuration for stdin, tcp and http inputs:
# http input
input { http { port => 3333 codec => json_lines { decode_size_limit_bytes => 40 target => wtf } } }
# tcp input
input { tcp { port => 3333 codec => json_lines { decode_size_limit_bytes => 20 } } }
# stdin input
input { stdin { codec => json_lines { decode_size_limit_bytes => 20 } } }
A couple of thoughts on this:
In https://github.com/elastic/logstash/pull/17229 changes to the buffered tokenizer introduce an edge-case where an oversized sequence of bytes that do not include a delimiter can be silently dropped.
That's true, it silently drops the remaining fragments of a token that doesn't have reached a delimiter. Suppose delimiter is \n and size_limit is 10, if it receives 2 fragments:
-
01234567890(11 chars) -
AAAAAThe accumulator store01234567890but silently dropsAAAAA. Thenextinvocation neither throws exception for01234567890fragment because it doesn't have a delimiter in the accumulator.
A solution could be to artificially add a delimiter for partial tokens whose fragments already sumup over the size_limit and logging when we drop data, avoid to stay silent on this case.