Implement BufferedTokenizer to return an iterable that can verify size limit for every token emitted
Release notes
Reimplements BufferedTokenizer to leverage pure Java classes instead of use JRuby runtime's classes.
What does this PR do?
Reimplement the BufferedTokenizerExt in pure Java using an iterable, while the BufferedTokenizerExt is a shell around this new class.
The principal method extract, which dice the data by separator, instead of returning a RubyArray now return an Iterable which is a wrapper on a chain of a couple of Iterators.
The first iterator (DataSplitter) accumulates data in a StringBuilder and then dice the data by separator.
The second iterator, used in cascade to the first, validate that each returned token size respect the sizeLimit parameter.
To be compliant with some usage patterns which expect an empty? method to be present in the returned object from extract, like this, the extract method of the BufferedTokenizerExt return an Iterable adapter custom class with such method.
On the test side the code that tested BufferedTokenizerExt is moved to test the new BufferedTokenizer, so some test classes was renamed:
BufferedTokenizerExtTestmostly becomesBufferedTokenizerTest, but there is still a smallBufferedTokenizerExtTestremained to test charset conversion use cases.BufferedTokenizerExtWithDelimiterTest->BufferedTokenizerWithDelimiterTestBufferedTokenizerExtWithSizeLimitTest->BufferedTokenizerWithSizeLimitTest- a test used to verify overflow condition
givenTooLongInputExtractDoesntOverflow(code ref) was removed because not anymore applicable.
On the benchmarking side:
BufferedTokenizerExtBenchmark->BufferedTokenizerBenchmarkwith the adaptation to the new tokenizer class.
As can be seen by benchmark reports in Logs section, this PR provide almost an improvement of 6x against the previous implementation.
Why is it important/What is the impact to the user?
As a developer I want the BufferedTokenizer implementation is simpler than the existing.
Checklist
- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- ~~[ ] I have made corresponding changes to the documentation~~
- ~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
- [x] I have added tests that prove my fix is effective or that my feature works
Author's Checklist
- [x] verify as in #16483
- [x] verify as in https://github.com/logstash-plugins/logstash-codec-json_lines/pull/43
- [x] verify charset encoding is respected
- [x] exhaustive test https://buildkite.com/elastic/logstash-exhaustive-tests-pipeline/builds/1650
How to test this PR locally
Run same tests as in:
- #16483
- https://github.com/logstash-plugins/logstash-codec-json_lines/pull/43 in this case after a couple of minutes Logstash goes OOM, which is a change in behaviour that I want to discuss in #17275
- Verify character encoding doesn't change:
- run the following pipeline
bin/logstash -e "input { tcp { port => 1234 codec => line { charset => 'ISO8859-1' } } } output { stdout { codec => rubydebug } }"
- using the following script to send
£sign in latin1
require 'socket'
hostname = 'localhost'
port = 1234
socket = TCPSocket.open(hostname, port)
text = "\xA3" # the £ symbol in ISO-8859-1 aka Latin-1
text.force_encoding("ISO-8859-1")
socket.puts(text)
socket.close
Related issues
- Closes #17017
Logs
Benchmarks
The benchmarks was updated to run for 3 seconds (instead of 100 ms) and to report in milliseconds (instead of nanoseconds).
baseline
Ran with:
./gradlew jmh -Pinclude="org.logstash.benchmark.BufferedTokenizerExtBenchmark.*"
Benchmark Mode Cnt Score Error Units
BufferedTokenizerExtBenchmark.multipleTokenPerFragment thrpt 10 553.913 ± 6.223 ops/ms
BufferedTokenizerExtBenchmark.multipleTokensCrossingMultipleFragments thrpt 10 222.815 ± 4.411 ops/ms
BufferedTokenizerExtBenchmark.onlyOneTokenPerFragment thrpt 10 1549.777 ± 9.237 ops/ms
this PR
Ran with:
./gradlew jmh -Pinclude="org.logstash.benchmark.BufferedTokenizerBenchmark.*"
Benchmark Mode Cnt Score Error Units
BufferedTokenizerBenchmark.multipleTokenPerFragment thrpt 10 3308.716 ± 167.549 ops/ms
BufferedTokenizerBenchmark.multipleTokensCrossingMultipleFragments thrpt 10 1245.505 ± 52.843 ops/ms
BufferedTokenizerBenchmark.onlyOneTokenPerFragment thrpt 10 9468.777 ± 182.184 ops/ms
This pull request does not have a backport label. Could you fix it @andsel? 🙏 To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-8./dis the label to automatically backport to the8./dbranch./dis the digit.backport-8.xis the label to automatically backport to the8.xbranch.
backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.
This pull request does not have a backport label. Could you fix it @andsel? 🙏 To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-8./dis the label to automatically backport to the8./dbranch./dis the digit.backport-8.xis the label to automatically backport to the8.xbranch.
backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.
Quality Gate passed
Issues
0 New issues
0 Fixed issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code
Hi @yaauie with the latest changes to this PR, now the append method actively checks if the accumulator is accumulating fragments passed the sizeLimit(if sizeLimit is set). However, the sizeLimit check and consequent exception throwing is still on the read side, in the iterator returned to consume the tokens.
If you can , please 🙏 review it.
Quality Gate passed
Issues
0 New issues
0 Fixed issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code
:green_heart: Build Succeeded
- Buildkite Build
- Commit: 116288225b928916c94a00be291826559897e4ab
History
- :green_heart: Build #3026 succeeded 4f4093428cc6fcd0319bab86a04ffdf5157ac094
- :green_heart: Build #3025 succeeded a05b6c4cf8b78febc4aecc6a2825e5e46d52264a
- :green_heart: Build #3021 succeeded c9598d9aef51c0f1f9fd48ff6e2d00b851706adb
- :broken_heart: Build #3020 failed 199fd7eaedc0149c7e3725b6bfaec5e9a217524c
- :green_heart: Build #3016 succeeded c0342f6b2c823252b701a7153ddb15a8e05afda1
- :green_heart: Build #2848 succeeded c25cbc4f014538416f16617060803140c748cb96
cc @andsel
This pull request does not have a backport label. Could you fix it @andsel? 🙏 To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-8./dis the label to automatically backport to the8./dbranch./dis the digit.- If no backport is necessary, please add the
backport-skiplabel