lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Move synonym map off-heap for SynonymGraphFilter

Open msfroh opened this issue 1 year ago • 14 comments

Description

This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.

The initial implementation is rough, but the unit tests pass with this change randomly enabled.

Obvious things that need work are:

  1. I tried to do something like a codec, but not really a codec for the synonym map files. For a solution that could evolve over time, we should probably at least write something to the metadata file saying what format was used.
  2. Right now it makes no effort to detect changes to the synonym files. I would suggest that SynonymGraphFilterFactory rebuild the directory if a checksum of the input files doesn't match a value recorded in the metadata file.
  3. I don't think I like the random seeks in OffHeapBytesRefHashLike, but I don't see an alternative (besides moving it on-heap). Given that the original issue was only about moving the FST off-heap, maybe we can keep the word dictionary on-heap.

Fixes #13005

msfroh avatar Jan 30 '24 08:01 msfroh

I think measuring latency impact would also be important for this change as off-heap will be slower (I understand that we are only adding off-heap as an option here, not forcing, but still).

dungba88 avatar Jan 31 '24 04:01 dungba88

I did some rough benchmarks using the large synonym file attached to https://issues.apache.org/jira/browse/LUCENE-3233

The benchmark code and input is at https://github.com/msfroh/lucene/commit/174f98a91e8709ee66dc2ed84b5c0b54e1a10635

Attempt On-heap load Off-heap load Off-heap reload On-heap process Off-heap process Off-heap reload process
1 1146.022381 1117.004359 4.099065 569.120851 656.430684 613.475144
2 1079.578922 1060.926854 1.761465 456.203168 655.596275 622.534246
3 1035.911388 1076.611629 1.750233 579.41094 655.955431 614.788388
4 1037.825728 1085.513933 2.074129 696.390519 688.664985 613.266972
5 1017.489384 1008.209808 1.717748 485.510526 620.800148 620.708538
6 1014.641653 1024.412669 1.740371 483.617261 619.696259 619.910897
7 1027.691397 1045.129567 1.727786 670.49456 622.48759 616.303549
8 984.005971 1009.265777 1.736832 513.543926 615.448442 613.06279
9 1027.841112 1027.057453 1.732985 486.502644 622.535269 620.285635
10 981.689573 1074.613506 1.71059 707.810107 613.417977 624.34832
11 1026.165712 1065.3181 1.689407 479.610417 621.454353 616.183786
12 994.949905 1046.898091 1.730394 498.938696 612.279425 619.965444
13 1035.144288 1043.119169 1.739726 472.821155 619.267425 613.029508
14 996.056368 1017.663948 1.699742 692.135015 619.725163 620.454352
15 1046.605644 1018.287866 1.713526 470.391592 619.723699 612.068366
16 1007.579733 1042.062818 1.70251 508.481346 619.481298 619.178419
17 1038.166702 1054.039165 1.683814 485.439337 620.901934 616.017789
18 1000.900448 1058.492139 1.7267 515.185816 622.204031 627.560895
19 1236.416447 1080.877889 1.643654 434.73928 624.825435 625.622426
20 997.663619 1038.478411 1.657257 497.232157 623.337627 620.943519
Mean 1036.617319 1049.699158 1.8518967 535.1789657 628.7116725 618.4854492
Stddev 59.71799264 28.44516049 0.535792004 86.95026923 19.55324941 4.52695571

So, it looks like the time to load synonyms is mostly unchanged (1050ms versus 1037ms), and loading "pre-compiled" synonyms is super-duper fast.

We do seem to take a 17.5% hit on processing time. (629ms versus 535ms.) I might try profiling to see where that time is being spent. If it's doing FST operations, I'll assume it's a cost of doing business. If it's spent loading the also off-heap output words, I might consider moving those (optionally?) back on heap.

msfroh avatar Feb 01 '24 03:02 msfroh

I decided to try experimenting with moving the output words back onto the heap, since I didn't like the fact that every word lookup was triggering a seek.

Running now, I got way less variance on the on-heap runs. I also added some GCs between iterations, since I wanted to measure the heap usage of each. That likely removed some GC pauses from the on-heap version.

I then switched back to the off-heap words to confirm the results that I saw last time (and compare against the implementation with on-heap words).

The conclusion seems to be roughly:

  • Existing on-heap FST averages about 444ms to process a lot of synonyms.
  • Off-heap FST with on-heap words averages 515 or 516ms. (About 16% slower than existing on-heap.)
  • Off-heap FST with off-heap words averages 620ms. (About 40% slower than existing on-heap.)

The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB.

With these trade-offs, I think off-heap FST with on-heap words may be a good choice for folks with large sets of synonyms. I don't think I would recommend off-heap FST with off-heap words.

Attempt OnHeap FST load time OffHeap FST (OnHeap words) load time OffHeap FST (OnHeap words) reload time OnHeap FST processing time OffHeap FST (OnHeap words) processing time OffHeap FST (OnHeap words) reloaded processing time OffHeap FST (OffHeap words) processing time OffHeap FST (OffHeap words) reloaded processing time
1 1191.339685 1072.285824 9.669646 436.391631 520.550704 516.11297 623.451546 620.531215
2 1030.432454 1033.619768 8.874105 448.848403 516.784387 517.230739 621.522464 622.793343
3 984.83645 1037.807342 8.912252 443.789813 512.066535 517.716981 622.455444 620.468985
4 1049.63589 1048.60113 8.894401 449.237547 518.946226 516.868933 617.837364 616.810236
5 990.22176 1049.618665 8.861166 448.923912 512.559801 511.114898 616.555422 617.122551
6 978.41877 1063.824595 8.930418 440.251675 517.632376 518.175232 621.969759 622.828416
7 985.434177 1049.113913 8.872906 443.209607 511.210536 518.802292 624.151468 622.097039
8 985.376238 1046.102696 8.823786 440.815454 517.491411 517.905752 623.390319 625.387487
9 983.341325 1065.892279 8.871586 449.145252 516.029267 516.916524 622.811992 622.798858
10 985.438642 1046.71167 8.8518 445.970679 512.045037 518.934149 622.592098 614.661805
11 990.592624 1050.377106 8.832753 443.844237 515.758106 510.808005 611.62254 622.956946
12 986.747374 1066.052969 8.884928 444.398327 517.259451 524.770132 622.085785 619.311172
13 984.328191 1052.189621 8.88281 439.612497 517.861131 515.796013 617.862222 615.101452
14 984.405339 1049.06783 8.835775 438.871305 517.885493 515.853446 615.254987 623.464483
15 997.323593 1064.473985 8.90682 443.640208 515.329143 518.807239 623.020916 623.013801
16 997.253932 1066.558928 8.900308 442.534843 511.930766 516.365803 624.316916 615.037306
17 999.464751 1046.464149 8.895899 443.48306 514.841946 517.082166 617.615908 618.661376
18 1001.896073 1045.304622 8.877555 444.875225 515.029862 510.365428 618.540866 624.355309
19 986.055833 1045.208347 8.863339 441.647553 511.489699 517.213428 623.61503 621.198543
20 984.112667 1047.317164 8.940865 451.304206 514.762544 510.45981 621.057397 621.483146
21 988.310511 1046.154648 8.865301 447.25874 514.859414 517.24163 623.916511 614.185296
22 982.874582 1062.113889 8.867098 439.785463 510.387721 516.885653 623.494968 622.527091
23 980.96967 1048.050631 8.867966 439.05464 511.423329 516.984465 621.567988 621.204435
24 983.189843 1046.083632 8.81578 440.574651 518.390122 520.392926 622.34785 614.923018
25 987.033178 1074.553767 8.812579 446.687106 513.914686 521.952744 615.870183 621.089011
26 985.771758 1076.245942 8.845264 444.718264 516.274395 513.5547 615.927497 615.53522
27 981.748774 1046.85677 8.818164 443.252924 513.632714 515.919924 626.659516 622.307368
28 983.979894 1062.317764 8.869256 443.267803 513.965345 509.688356 615.790469 615.712761
29 980.908776 1045.006602 8.855109 444.452376 517.488159 509.770143 621.96871 621.582871
30 981.508232 1046.722776 8.790313 443.952753 513.840793 512.847346 621.747601 621.901271
31 999.165558 1063.517734 8.792905 440.356205 517.677777 517.920992 620.90204 613.422668
32 1000.854281 1060.766663 9.027399 444.385706 510.006231 514.006688 623.684492 620.742008
33 1001.620724 1046.329083 8.72687 443.912072 509.793229 513.313214 620.695915 621.266234
34 1008.677463 1044.437966 8.799494 447.077333 516.263674 514.751767 622.775084 620.885167
35 987.309353 1048.062722 8.763122 440.748052 518.972785 518.608101 621.032898 620.85482
36 980.960836 1052.037316 8.834358 445.210623 518.850346 511.742763 620.719135 621.679027
37 983.807955 1049.894433 8.798039 440.302584 511.351473 510.557417 615.059624 619.802549
38 982.144377 1071.744423 8.81421 444.036711 518.551589 515.779265 614.579103 615.092139
39 984.460101 1051.399337 8.781058 439.998112 518.709639 511.192122 614.797646 621.154429
40 981.16924 1047.739329 8.827411 446.515425 512.815557 519.138446 621.769829 613.83963
Mean 995.5780219 1053.415701 8.87387035 443.6585744 515.115835 515.7387151 620.4259376 619.7447621
StdDev 34.59108879 10.41692769 0.139732265 3.344053675 2.951489941 3.528886642 3.48797555 3.346472427

msfroh avatar Feb 09 '24 08:02 msfroh

@dungba88 -- I'm trying to resolve conflicts with your changes, but I'm a little stuck. I don't understand how we're supposed to use the FST APIs to write the FST to disk now.

After merging our changes, SynonymMap contains:

      FST<BytesRef> fst = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader());
      if (directory != null) {
        fstOutput.close(); // TODO -- Should fstCompiler.compile take care of this?
        try (SynonymMapDirectory.WordsOutput wordsOutput = directory.wordsOutput()) {
          BytesRef scratchRef = new BytesRef();
          for (int i = 0; i < words.size(); i++) {
            words.get(i, scratchRef);
            wordsOutput.addWord(scratchRef);
          }
        }
        directory.writeMetadata(words.size(), maxHorizontalContext, fst);
        return directory.readMap();
      }

That call to FST.fromFSTReader(...) fails with:

The DataOutput must implement FSTReader, but got FSIndexOutput(path="/home/froh/ws/lucene/lucene/analysis/common/build/tmp/tests-tmp/lucene.analysis.synonym.TestSynonymGraphFilter_1171182AD5892267-001/tempDir-001/synonyms.fst")

Is there something else that I'm supposed to be calling on the write path? Note that in the "off-heap" case above (when directory != null), we just need to write the FST. The directory.readMap() call loads it fresh from disk, discarding the FST that we constructed on heap.

msfroh avatar Feb 23 '24 18:02 msfroh

@msfroh

As you only need to write the FST metadata, there is no need to create the FST. You can just call

directory.writeMetadata(words.size(), maxHorizontalContext, fstMetadata);

where as fstMetadata is returned by fstCompiler.compile()

dungba88 avatar Feb 24 '24 08:02 dungba88

@msfroh

As you only need to write the FST metadata, there is no need to create the FST. You can just call

directory.writeMetadata(words.size(), maxHorizontalContext, fstMetadata);

where as fstMetadata is returned by fstCompiler.compile()

Hmm... I'm not sure that will work, because the logic to save the metadata is associated with the FST instance (i.e. in the FST::saveMetadata method).

I tried extracting that method out into a static, but the line outputs.writeFinalOutput(...) breaks things.

msfroh avatar Feb 25 '24 22:02 msfroh

You are right, the saveMetadata is still in FST.

Now to create the FST written off heap, you need to create the corresponding DataInput and use the FST constructor.

However the saveMetadata method can be moved to FSTMetadata as well since all of the information are stored there (including outputs)

dungba88 avatar Feb 26 '24 00:02 dungba88

I could put a PR for the saveMetadata change if you prefer.

dungba88 avatar Feb 26 '24 02:02 dungba88

I could put a PR for the saveMetadata change if you prefer.

I'll update to take care of that. Thanks for the pointers!

msfroh avatar Feb 26 '24 04:02 msfroh

I realized I also need the saveMetadata change for https://github.com/apache/lucene/pull/12985. Do you think we should make it a standalone PR and merge first? Otherwise I've cherry-picked from this PR :)

dungba88 avatar Feb 27 '24 14:02 dungba88

I realized I also need the saveMetadata change for https://github.com/apache/lucene/pull/12985. Do you think we should make it a standalone PR and merge first? Otherwise I've cherry-picked from this PR :)

Sure -- a standalone PR should work. We can both rebase onto that.

msfroh avatar Feb 27 '24 17:02 msfroh

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Mar 13 '24 00:03 github-actions[bot]

@dungba88 - I forgot about this change for a while. Did you create a separate PR for the saveMetadata change? Should I?

msfroh avatar May 25 '24 03:05 msfroh

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jun 09 '24 00:06 github-actions[bot]

@msfroh I also forgot about this. Let me create a PR

dungba88 avatar Jul 08 '24 04:07 dungba88

I published a PR here: https://github.com/apache/lucene/pull/13549. Please take a look when you have time!

dungba88 avatar Jul 08 '24 04:07 dungba88

Note: The above PR has been merged

dungba88 avatar Jul 22 '24 23:07 dungba88

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Aug 17 '24 00:08 github-actions[bot]