Allow automatic canonicalization (possibly `String.intern()`) of `JsonToken.VALUE_STRING` values
Version
2.12.4
Feature request
Existing scenario
I think the INTERN_FIELD_NAMES flag is great thing to have, which is turned on by default. This would save greatly on the memory footprint especially when the message size is huge, imagine ~3.2 million position messages, all with same key like 'portfolio', 'book'.
Instead of having ~3.2 million of portfolio in parsing each batch of message in heap, with INTERN_FIELD_NAMES, it would result in only one portfolio on the string pool regardless of the message sizes, being ~3.2 million or even more.
Changes proposed
A similar feature flag could be provided, even turned on by default as well, when parsing the values.
So that, back to the ~3.2 million records example, instead of having ~3.2 million portfolio names in the heap, the similar feature flag would result in only around ~200 portfolio name (like Jason, Jackson) in the string pool.
Possible changes
From here, it could take in the feature flag, and apply the intern if the flag is on https://github.com/FasterXML/jackson-core/blob/2.14/src/main/java/com/fasterxml/jackson/core/util/TextBuffer.java#L797
public String setCurrentAndReturn(int len) {
_currentSize = len;
// We can simplify handling here compared to full `contentsAsString()`:
if (_segmentSize > 0) { // longer text; call main method
return contentsAsString();
}
// more common case: single segment
int currLen = _currentSize;
String str = (currLen == 0) ? "" : new String(_currentSegment, 0, currLen);
if (JsonFactory.Feature.`INTERN_FIELD_VALUES`.enabledIn(_flags)) {
str = InternCache.instance.intern(str );
}
_resultString = str;
return str;
}
To add some context, here is a peek at the String key and value memory address, with existing implementation:
18:15:58.102 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6c9fe73e0
18:15:58.102 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6c9fef010
18:15:58.102 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6c9ffbe00
18:15:58.103 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca003ad0
18:15:58.103 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca010fc0
18:15:58.103 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca018d90
18:15:58.103 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca025910
18:15:58.103 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca02d4f0
18:15:58.103 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca037e88
18:15:58.103 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca03f800
18:15:58.103 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca04a250
18:15:58.104 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca051ce8
18:15:58.104 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca05d0d8
18:15:58.104 [main] INFO c.l.zg.TestConsumerString - portfolio:PM_JASON@: 0x6c37d99a8:: 0x6ca0649b8
The keys are pointing to same string in the pool, while the value, even though they are the same, they are created on heap with each encountering of new record.
I am open to this idea, but probably requires some sort of handler to let customization of which String values are to be intern()ed and/or how to handle canonicalization (possibly using other mechanisms) -- most likely there would be limit to the length of String to intern().
Not sure what kind of interface should be used, PRs welcome.