logstash Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted

Release notes

Reimplements BufferedTokenizer to leverage pure Java classes instead of use JRuby runtime's classes.

What does this PR do?

Reimplement the BufferedTokenizerExt in pure Java using an iterable, while the BufferedTokenizerExt is a shell around this new class.

The principal method extract, which dice the data by separator, instead of returning a RubyArray now return an Iterable which is a wrapper on a chain of a couple of Iterators. The first iterator (DataSplitter) accumulates data in a StringBuilder and then dice the data by separator. The second iterator, used in cascade to the first, validate that each returned token size respect the sizeLimit parameter.

To be compliant with some usage patterns which expect an empty? method to be present in the returned object from extract, like this, the extract method of the BufferedTokenizerExt return an Iterable adapter custom class with such method.

On the test side the code that tested BufferedTokenizerExt is moved to test the new BufferedTokenizer, so some test classes was renamed:

BufferedTokenizerExtTest mostly becomes BufferedTokenizerTest, but there is still a small BufferedTokenizerExtTest remained to test charset conversion use cases.
BufferedTokenizerExtWithDelimiterTest -> BufferedTokenizerWithDelimiterTest
BufferedTokenizerExtWithSizeLimitTest -> BufferedTokenizerWithSizeLimitTest
a test used to verify overflow condition givenTooLongInputExtractDoesntOverflow (code ref) was removed because not anymore applicable.

On the benchmarking side:

BufferedTokenizerExtBenchmark -> BufferedTokenizerBenchmark with the adaptation to the new tokenizer class.

As can be seen by benchmark reports in Logs section, this PR provide almost an improvement of 6x against the previous implementation.

Why is it important/What is the impact to the user?

As a developer I want the BufferedTokenizer implementation is simpler than the existing.

Checklist

[x] My code follows the style guidelines of this project
[x] I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
[x] I have added tests that prove my fix is effective or that my feature works

Author's Checklist

[x] verify as in #16483
[x] verify as in https://github.com/logstash-plugins/logstash-codec-json_lines/pull/43
[x] verify charset encoding is respected
[x] exhaustive test https://buildkite.com/elastic/logstash-exhaustive-tests-pipeline/builds/1650

How to test this PR locally

Run same tests as in:

#16483
https://github.com/logstash-plugins/logstash-codec-json_lines/pull/43 in this case after a couple of minutes Logstash goes OOM, which is a change in behaviour that I want to discuss in #17275
Verify character encoding doesn't change:
- run the following pipeline

bin/logstash -e "input { tcp { port => 1234 codec => line { charset => 'ISO8859-1' } } } output { stdout { codec => rubydebug } }"

using the following script to send £ sign in latin1

require 'socket'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

text = "\xA3" # the £ symbol in ISO-8859-1 aka Latin-1
text.force_encoding("ISO-8859-1")
socket.puts(text)

socket.close

Related issues

Closes #17017

Logs

Benchmarks

The benchmarks was updated to run for 3 seconds (instead of 100 ms) and to report in milliseconds (instead of nanoseconds).

baseline

Ran with:

./gradlew jmh -Pinclude="org.logstash.benchmark.BufferedTokenizerExtBenchmark.*"

Benchmark                                                               Mode  Cnt     Score   Error   Units
BufferedTokenizerExtBenchmark.multipleTokenPerFragment                 thrpt   10   553.913 ± 6.223  ops/ms
BufferedTokenizerExtBenchmark.multipleTokensCrossingMultipleFragments  thrpt   10   222.815 ± 4.411  ops/ms
BufferedTokenizerExtBenchmark.onlyOneTokenPerFragment                  thrpt   10  1549.777 ± 9.237  ops/ms

this PR

Ran with:

./gradlew jmh -Pinclude="org.logstash.benchmark.BufferedTokenizerBenchmark.*"

Benchmark                                                            Mode  Cnt     Score     Error   Units
BufferedTokenizerBenchmark.multipleTokenPerFragment                 thrpt   10  3308.716 ± 167.549  ops/ms
BufferedTokenizerBenchmark.multipleTokensCrossingMultipleFragments  thrpt   10  1245.505 ±  52.843  ops/ms
BufferedTokenizerBenchmark.onlyOneTokenPerFragment                  thrpt   10  9468.777 ± 182.184  ops/ms

Mar 05 '25 11:03 andsel

This pull request does not have a backport label. Could you fix it @andsel? 🙏 To fixup this pull request, you need to add the backport labels for the needed branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-8.x is the label to automatically backport to the 8.x branch.

Mar 05 '25 11:03 mergify[bot]

backport-8.x has been added to help with the transition to the new branch 8.x. If you don't need it please use backport-skip label.

Mar 05 '25 11:03 mergify[bot]

This pull request does not have a backport label. Could you fix it @andsel? 🙏 To fixup this pull request, you need to add the backport labels for the needed branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-8.x is the label to automatically backport to the 8.x branch.

Mar 05 '25 11:03 mergify[bot]

backport-8.x has been added to help with the transition to the new branch 8.x. If you don't need it please use backport-skip label.

Mar 05 '25 11:03 mergify[bot]

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

Apr 14 '25 13:04 elastic-sonarqube[bot]

Hi @yaauie with the latest changes to this PR, now the append method actively checks if the accumulator is accumulating fragments passed the sizeLimit(if sizeLimit is set). However, the sizeLimit check and consequent exception throwing is still on the read side, in the iterator returned to consume the tokens. If you can , please 🙏 review it.

Jun 19 '25 11:06 andsel

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

Jul 02 '25 13:07 elastic-sonarqube[bot]

:green_heart: Build Succeeded

Buildkite Build
Commit: 116288225b928916c94a00be291826559897e4ab

History

:green_heart: Build #3026 succeeded 4f4093428cc6fcd0319bab86a04ffdf5157ac094
:green_heart: Build #3025 succeeded a05b6c4cf8b78febc4aecc6a2825e5e46d52264a
:green_heart: Build #3021 succeeded c9598d9aef51c0f1f9fd48ff6e2d00b851706adb
:broken_heart: Build #3020 failed 199fd7eaedc0149c7e3725b6bfaec5e9a217524c
:green_heart: Build #3016 succeeded c0342f6b2c823252b701a7153ddb15a8e05afda1
:green_heart: Build #2848 succeeded c25cbc4f014538416f16617060803140c748cb96

cc @andsel

Jul 02 '25 13:07 elasticmachine

This pull request does not have a backport label. Could you fix it @andsel? 🙏 To fixup this pull request, you need to add the backport labels for the needed branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
If no backport is necessary, please add the backport-skip label

Jul 02 '25 13:07 mergify[bot]