BlackLab TermsReader scalability, performance and documentation

TermsReader scalability, performance and documentation

Open jan-niestadt opened this issue 2 years ago • 4 comments

If all unique terms combined total more than 2GB of character data, TermsReader will break. See TermReader:

            // FIXME this code breaks when char term data total more than 2 GB
            //       (because offset will overflow)
            int offset = termCharData.length * Integer.MAX_VALUE; // set to beginning of current array

Our large corpus hasn't grown to this size, but will if it gets about 3-4x larger, so we should fix this before then.

Mar 23 '22 14:03 jan-niestadt

It might be nice to add a few more comments to this code explaining why things are implemented this way (from a more birdseye view).

Mar 23 '22 14:03 jan-niestadt

Integer.MAX_VALUE is also not a safe maximum for allocating an array. Use BlackLab.JAVA_MAX_ARRAY_SIZE instead (which is set to Integer.MAX_VALUE - 8, which should be safe on current JVMs)

Mar 23 '22 14:03 jan-niestadt

I'm also wondering if it's really necessary to read the terms from the terms file, decode them into String objects, then convert them to byte arrays again. I guess the String versions are used elsewhere, but maybe they could be eliminated to further speed up startup.

And we essentially seem to be manually mapping the terms file into memory; would regular memory mapping work too? Of course that would mean it could be swapped out of the disk cache if not used for a while, but that might be okay? Term strings don't seem to be used in a time-critical way, mostly just for display.

Mar 23 '22 14:03 jan-niestadt

Just making a note that the array size issue from the OP was fixed in https://github.com/INL/BlackLab/commit/5d61a17f35117b4b4351834be377058f5d97a298

Aug 29 '22 09:08 KCMertens

As our focus is on the integrated index format, and getting rid of global terms and forward index code as much as possible, I'm closing this now.

Jul 06 '23 13:07 jan-niestadt

BlackLab BlackLab copied to clipboard

TermsReader scalability, performance and documentation

BlackLab
BlackLab copied to clipboard