lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Add a method allowing canonical strings to be returned from DataInput

Open thecoop opened this issue 2 years ago • 7 comments

Use a shared buffer for decoding short strings

thecoop avatar Oct 13 '22 10:10 thecoop

I don't know what this string interning is here, but I am strongly opposed to it

rmuir avatar Oct 13 '22 11:10 rmuir

This PR does not use String.intern which was the previous concern. So what's wrong here?

dsmiley avatar Oct 13 '22 22:10 dsmiley

It does essentially the same thing. Leaking memory on purpose into static finals.

rmuir avatar Oct 14 '22 02:10 rmuir

The map isn't static.

Even if there was a static map, if it was expressly used for known static strings, then it wouldn't be a leak but just re-use of constants.

dsmiley avatar Oct 14 '22 03:10 dsmiley

it absolutely is. Look at the fucking diff dude. You won't sneak this one past me.

-1 to this interning stuff. I remember what things were like when we made this mistake before and had to fix it.

rmuir avatar Oct 14 '22 10:10 rmuir

Could you comment on the pertinent line of code please? canonicalStrings is not static. Maybe you are thinking of the ThreadLocal? That's some temporary byte buffer.

dsmiley avatar Oct 14 '22 12:10 dsmiley

both are an issue.

The threadlocal will leak badly, especially for cases that "churn" lots of threads (e.g. jetty defaults, e.g. java webapps). It will cause more harm than good.

the private map just leaks for the caller by unnecessarily saving strings. it's a lot of overhead to datainput to add this map. keep in mind datainput isn't closeable and has other implementations such as ByteArrayDataInput.

Sorry, I don't see advantages to this stuff.

rmuir avatar Oct 14 '22 19:10 rmuir

The ThreadLocal aspect was a needless addition here; it could be reverted.

the private map just leaks for the caller by unnecessarily saving strings. it's a lot of overhead to datainput to add this map. keep in mind datainput isn't closeable and has other implementations such as ByteArrayDataInput.

Let's imagine a small improvement to the patch to lazily create the map on first use. Some usages of ByteArrayDataInput and perhaps others would never use it (e.g. SynonymGraphFilter). Are there cases you see that would lead to undesirable use of these canonical strings? That it's not closable doesn't seem to be an issue to me since DataInput should not be outliving the things it was used to read; quite the reverse, no?

dsmiley avatar Oct 15 '22 04:10 dsmiley

I am currently reviewing the impact of these changes on performance, GC, and memory usage to lucene & elasticsearch. If this does show an improvement, I'll set this PR to ready to review. If not, I'll close it.

thecoop avatar Oct 16 '22 09:10 thecoop

Sorry dsmiley, clearly you want this change, but I don't have to justify my hate for memory leaks.

rmuir avatar Oct 19 '22 10:10 rmuir

For reference, this is aimed at #11711 and #11712

thecoop avatar Nov 02 '22 11:11 thecoop

This is ready to review. I've removed the threadlocal byte buffer.

With testing in elasticsearch, this reduced the duplicated strings in these fields by 90%. For the linked tickets this would reduce the memory usage from these strings from 6GB to 600MB. There is no significant performance difference I can see from running this in lucenebench.

In the cases where the new method and readMapOfStrings is not called, this would just add an empty HashMap instance to each DataInput instance, with the same lifetime as the DataInput. When readCanonicalString is called, this added a single hashmap lookup to the call compared to getString. It also requires memory for the backing HashMap, with the same lifecycle as the DataInput, scaling with the number of distinct strings returned by that method.

So this causes slightly more memory when deserializing, at the benefit of using drastically less memory once it is all deserialized and the DataInput & hashmap have been GCd.

Please let me know if there's other branches this should be based off/targetting

thecoop avatar Nov 03 '22 10:11 thecoop

Another option is to have the FieldInfosFormat classes handle string pooling themselves, leaving DataInput unchanged.

thecoop avatar Nov 03 '22 11:11 thecoop

I still have a problem with leaking memory on purpose. The issue here is elastic's bad design, not lucene. Otherwise you wouldnt have thousands of indexes X thousands of fields: that situation is your own creation.

I'm -1 to adding this garbage to datainput. I don't care what it does for your abuse-case.

rmuir avatar Nov 03 '22 11:11 rmuir

There is somewhat a matter of taste at stake, so I can appreciate Robert's point of view that DataInput is nice the way it is without adding this to it for an edge (abuse?) case. I've seen Lucene based indexes used in a variety of ways so I'm sympathetic for finding a path forward. If this optimization can be isolated to a small place in FieldInfosFormat and also fix the problem (?), will that address your concern Robert? Maybe we need to see what it'd look like first.

dsmiley avatar Nov 03 '22 21:11 dsmiley

I think if you get yourself in this situation via oversharding/dynamic fields/etc, then you can add -XX:+UseStringDeduplication to your commandline, problem solved.

rmuir avatar Nov 03 '22 22:11 rmuir

-XX:+UseStringDeduplication to your commandline, problem solved.

This is actually a very elegant workaround... I wonder what the benchmarks would show on those affected systems.

dweiss avatar Nov 04 '22 06:11 dweiss

Unfortunately that doesn't seem to have much of an impact, from what I can see here.

@rmuir Would you be against having a string cache specifically in the relevant methods in FieldInfosFormat?

thecoop avatar Nov 08 '22 11:11 thecoop

yes because it would translate as a leak for many other use-cases/applications.

rmuir avatar Nov 08 '22 12:11 rmuir

To be clear, are you referring to the extra memory used by the deduplication hashmap for the duration of the deserialisation, that will then be eligible for GC after the method returns?

thecoop avatar Nov 08 '22 12:11 thecoop

I've changed it to use a separate object for the string map, scoped to the deserialisation method.

I couldn't come up with a nice way to handle readMapOfStrings to cache the key strings so far, it's much harder to do that once the map is actually created.

thecoop avatar Nov 16 '22 11:11 thecoop

It looks like there are strong objections to sharing string instances here, and there is a JVM command-line flag that may achieve similar gains for many indices X segments X fields sort of use case?

mikemccand avatar Nov 02 '23 12:11 mikemccand