logstash icon indicating copy to clipboard operation
logstash copied to clipboard

BufferedTokenizer may silently drop data when oversize input has no delimiters

Open yaauie opened this issue 4 months ago • 2 comments

Logstash information:

Please include the following information:

  1. Logstash version (e.g. bin/logstash --version): 9.2.0-SNAPSHOT
  2. Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker) N/A
  3. How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes) N/A

Description of the problem including expected versus actual behavior:

In https://github.com/elastic/logstash/pull/17229 changes to the buffered tokenizer introduce an edge-case where an oversized sequence of bytes that do not include a delimiter can be silently dropped.

This occurs because size validation happens after the token is fully accumulated, which does not occur if no trailing separator is detected.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) pipeline definition(s), settings, locale, etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

Provide logs (if relevant):

yaauie avatar Oct 20 '25 22:10 yaauie

Plugins that use the json_lines codec:

  1. stdin
  2. tcp input
  3. http input
  4. logstash input integration
  5. logstash input elastic_serverless_forwarder

Of these, logstash integration and esf input don't support configuring the decode size limit which defaults to 512MB.

Example configuration for stdin, tcp and http inputs:

# http input
input { http { port => 3333 codec => json_lines { decode_size_limit_bytes => 40 target => wtf } } }
# tcp input
input { tcp { port => 3333 codec => json_lines { decode_size_limit_bytes => 20 } } }
# stdin input
input { stdin { codec => json_lines { decode_size_limit_bytes => 20 } } }

jsvd avatar Oct 21 '25 13:10 jsvd

A couple of thoughts on this:

In https://github.com/elastic/logstash/pull/17229 changes to the buffered tokenizer introduce an edge-case where an oversized sequence of bytes that do not include a delimiter can be silently dropped.

That's true, it silently drops the remaining fragments of a token that doesn't have reached a delimiter. Suppose delimiter is \n and size_limit is 10, if it receives 2 fragments:

  1. 01234567890 (11 chars)
  2. AAAAA The accumulator store 01234567890 but silently drops AAAAA. The next invocation neither throws exception for 01234567890 fragment because it doesn't have a delimiter in the accumulator.

A solution could be to artificially add a delimiter for partial tokens whose fragments already sumup over the size_limit and logging when we drop data, avoid to stay silent on this case.

andsel avatar Dec 03 '25 13:12 andsel