logstash BufferedTokenizer may silently drop data when oversize input has no delimiters

Logstash information:

Please include the following information:

Logstash version (e.g. bin/logstash --version): 9.2.0-SNAPSHOT
Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker) N/A
How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes) N/A

Description of the problem including expected versus actual behavior:

In https://github.com/elastic/logstash/pull/17229 changes to the buffered tokenizer introduce an edge-case where an oversized sequence of bytes that do not include a delimiter can be silently dropped.

This occurs because size validation happens after the token is fully accumulated, which does not occur if no trailing separator is detected.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) pipeline definition(s), settings, locale, etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

Provide logs (if relevant):

Oct 20 '25 22:10 yaauie

Plugins that use the json_lines codec:

stdin
tcp input
http input
logstash input integration
logstash input elastic_serverless_forwarder

Of these, logstash integration and esf input don't support configuring the decode size limit which defaults to 512MB.

Example configuration for stdin, tcp and http inputs:

# http input
input { http { port => 3333 codec => json_lines { decode_size_limit_bytes => 40 target => wtf } } }
# tcp input
input { tcp { port => 3333 codec => json_lines { decode_size_limit_bytes => 20 } } }
# stdin input
input { stdin { codec => json_lines { decode_size_limit_bytes => 20 } } }

Oct 21 '25 13:10 jsvd

A couple of thoughts on this:

In https://github.com/elastic/logstash/pull/17229 changes to the buffered tokenizer introduce an edge-case where an oversized sequence of bytes that do not include a delimiter can be silently dropped.

That's true, it silently drops the remaining fragments of a token that doesn't have reached a delimiter. Suppose delimiter is \n and size_limit is 10, if it receives 2 fragments:

01234567890 (11 chars)
AAAAA The accumulator store 01234567890 but silently drops AAAAA. The next invocation neither throws exception for 01234567890 fragment because it doesn't have a delimiter in the accumulator.

A solution could be to artificially add a delimiter for partial tokens whose fragments already sumup over the size_limit and logging when we drop data, avoid to stay silent on this case.

Dec 03 '25 13:12 andsel