lucene
lucene copied to clipboard
Add a method allowing canonical strings to be returned from DataInput
Use a shared buffer for decoding short strings
I don't know what this string interning is here, but I am strongly opposed to it
This PR does not use String.intern
which was the previous concern. So what's wrong here?
It does essentially the same thing. Leaking memory on purpose into static finals.
The map isn't static.
Even if there was a static map, if it was expressly used for known static strings, then it wouldn't be a leak but just re-use of constants.
it absolutely is. Look at the fucking diff dude. You won't sneak this one past me.
-1 to this interning stuff. I remember what things were like when we made this mistake before and had to fix it.
Could you comment on the pertinent line of code please? canonicalStrings
is not static. Maybe you are thinking of the ThreadLocal? That's some temporary byte buffer.
both are an issue.
The threadlocal will leak badly, especially for cases that "churn" lots of threads (e.g. jetty defaults, e.g. java webapps). It will cause more harm than good.
the private map just leaks for the caller by unnecessarily saving strings. it's a lot of overhead to datainput to add this map. keep in mind datainput isn't closeable and has other implementations such as ByteArrayDataInput.
Sorry, I don't see advantages to this stuff.
The ThreadLocal aspect was a needless addition here; it could be reverted.
the private map just leaks for the caller by unnecessarily saving strings. it's a lot of overhead to datainput to add this map. keep in mind datainput isn't closeable and has other implementations such as ByteArrayDataInput.
Let's imagine a small improvement to the patch to lazily create the map on first use. Some usages of ByteArrayDataInput and perhaps others would never use it (e.g. SynonymGraphFilter). Are there cases you see that would lead to undesirable use of these canonical strings? That it's not closable doesn't seem to be an issue to me since DataInput should not be outliving the things it was used to read; quite the reverse, no?
I am currently reviewing the impact of these changes on performance, GC, and memory usage to lucene & elasticsearch. If this does show an improvement, I'll set this PR to ready to review. If not, I'll close it.
Sorry dsmiley, clearly you want this change, but I don't have to justify my hate for memory leaks.
For reference, this is aimed at #11711 and #11712
This is ready to review. I've removed the threadlocal byte buffer.
With testing in elasticsearch, this reduced the duplicated strings in these fields by 90%. For the linked tickets this would reduce the memory usage from these strings from 6GB to 600MB. There is no significant performance difference I can see from running this in lucenebench.
In the cases where the new method and readMapOfStrings
is not called, this would just add an empty HashMap
instance to each DataInput
instance, with the same lifetime as the DataInput.
When readCanonicalString
is called, this added a single hashmap lookup to the call compared to getString
. It also requires memory for the backing HashMap
, with the same lifecycle as the DataInput, scaling with the number of distinct strings returned by that method.
So this causes slightly more memory when deserializing, at the benefit of using drastically less memory once it is all deserialized and the DataInput & hashmap have been GCd.
Please let me know if there's other branches this should be based off/targetting
Another option is to have the FieldInfosFormat
classes handle string pooling themselves, leaving DataInput
unchanged.
I still have a problem with leaking memory on purpose. The issue here is elastic's bad design, not lucene. Otherwise you wouldnt have thousands of indexes X thousands of fields: that situation is your own creation.
I'm -1 to adding this garbage to datainput. I don't care what it does for your abuse-case.
There is somewhat a matter of taste at stake, so I can appreciate Robert's point of view that DataInput is nice the way it is without adding this to it for an edge (abuse?) case. I've seen Lucene based indexes used in a variety of ways so I'm sympathetic for finding a path forward. If this optimization can be isolated to a small place in FieldInfosFormat and also fix the problem (?), will that address your concern Robert? Maybe we need to see what it'd look like first.
I think if you get yourself in this situation via oversharding/dynamic fields/etc, then you can add -XX:+UseStringDeduplication
to your commandline, problem solved.
-XX:+UseStringDeduplication to your commandline, problem solved.
This is actually a very elegant workaround... I wonder what the benchmarks would show on those affected systems.
Unfortunately that doesn't seem to have much of an impact, from what I can see here.
@rmuir Would you be against having a string cache specifically in the relevant methods in FieldInfosFormat?
yes because it would translate as a leak for many other use-cases/applications.
To be clear, are you referring to the extra memory used by the deduplication hashmap for the duration of the deserialisation, that will then be eligible for GC after the method returns?
I've changed it to use a separate object for the string map, scoped to the deserialisation method.
I couldn't come up with a nice way to handle readMapOfStrings
to cache the key strings so far, it's much harder to do that once the map is actually created.
It looks like there are strong objections to sharing string instances here, and there is a JVM command-line flag that may achieve similar gains for many indices X segments X fields sort of use case?