BlackLab
BlackLab copied to clipboard
TermsReader scalability, performance and documentation
If all unique terms combined total more than 2GB of character data, TermsReader will break. See TermReader:
// FIXME this code breaks when char term data total more than 2 GB
// (because offset will overflow)
int offset = termCharData.length * Integer.MAX_VALUE; // set to beginning of current array
Our large corpus hasn't grown to this size, but will if it gets about 3-4x larger, so we should fix this before then.
It might be nice to add a few more comments to this code explaining why things are implemented this way (from a more birdseye view).
Integer.MAX_VALUE
is also not a safe maximum for allocating an array. Use BlackLab.JAVA_MAX_ARRAY_SIZE
instead (which is set to Integer.MAX_VALUE - 8
, which should be safe on current JVMs)
I'm also wondering if it's really necessary to read the terms from the terms file, decode them into String objects, then convert them to byte arrays again. I guess the String versions are used elsewhere, but maybe they could be eliminated to further speed up startup.
And we essentially seem to be manually mapping the terms file into memory; would regular memory mapping work too? Of course that would mean it could be swapped out of the disk cache if not used for a while, but that might be okay? Term strings don't seem to be used in a time-critical way, mostly just for display.
Just making a note that the array size issue from the OP was fixed in https://github.com/INL/BlackLab/commit/5d61a17f35117b4b4351834be377058f5d97a298
As our focus is on the integrated index format, and getting rid of global terms and forward index code as much as possible, I'm closing this now.