overflowdb
overflowdb copied to clipboard
use regular jdk String.intern()
I'm not sure why we didn't use String.intern()
from the start.
This PR doesn't only simplify our codebase, but saves us the
internedStrings
ConcurrentHashMap. The heap size for a large CPG
(linux 4 kernel) goes down from 35830m to 35260m, i.e. 570m, or 1.6%.
cf https://shipilev.net/jvm/anatomy-quarks/10-string-intern/
The tl;dr is that String.intern()
is unfortunately sharing tables with JVM-internals (e.g. strings from class-file constant pools) and is probably inappropriate; and unfortunately there is no ready-made alternative in the standard library.
However, I'm open to being convinced by benchmarks. A critical question (in context of mesh) would be impact on GC times and memory consumption if we do multiple scans without restarting the JVM in the middle.
Do you really think that String.intern()
is on the hot path for mesh? I very much doubt it...
Apart from mesh being closed source, the entire stack is dockerised. So in order to create those benchmarks I'd need to jump through a few hoops, and it's only reproducable for SL interns.
How about some scans on joern instead? I don't want to make a scientific paper out of it, just a few observations re GC and memory usage for multiple scans on the same JVM instance - sounds good?
Sounds like we want to keep things the way they are. Sadly there is no date on the linked article but i would also want to see some benchmarks before continuing with this.
i would also want to see some benchmarks before continuing with this.
would you be happy with the one described in my comment above?
How about some scans on joern instead?
Sounds good to me. Just run a couple different scans after another. My main point about mesh was that we must include multiple scans after each other without jvm restart, because string.intern() uses special stuff and we pay not just for the call to string.intern but also distributed over future GC pauses.
However, it sounds like the general consensus is "never ever use String.intern()
, always use custom interner".
I would not be opposed to writing a more appropriate hash-table for string interning, though. How did you identify the interning as a problem? ("more appropriate": We can relax some requirements inherent to ConcurrentHashMap
, so I would not be surptrised if we can do better for our specific use case)
How did you identify the interning as a problem?
When I opened up a large database heapdump (CPG with linux 4 kernel), the StringInterner stood out as one of the big blobs.