lucene
lucene copied to clipboard
Move synonym map off-heap for SynonymGraphFilter
Description
This stores the synonym map's FST and word lookup off-heap in a separate, configurable directory.
The initial implementation is rough, but the unit tests pass with this change randomly enabled.
Obvious things that need work are:
- I tried to do something like a codec, but not really a codec for the synonym map files. For a solution that could evolve over time, we should probably at least write something to the metadata file saying what format was used.
- Right now it makes no effort to detect changes to the synonym files. I would suggest that SynonymGraphFilterFactory rebuild the directory if a checksum of the input files doesn't match a value recorded in the metadata file.
- I don't think I like the random seeks in
OffHeapBytesRefHashLike
, but I don't see an alternative (besides moving it on-heap). Given that the original issue was only about moving the FST off-heap, maybe we can keep the word dictionary on-heap.
Fixes #13005
I think measuring latency impact would also be important for this change as off-heap will be slower (I understand that we are only adding off-heap as an option here, not forcing, but still).
I did some rough benchmarks using the large synonym file attached to https://issues.apache.org/jira/browse/LUCENE-3233
The benchmark code and input is at https://github.com/msfroh/lucene/commit/174f98a91e8709ee66dc2ed84b5c0b54e1a10635
Attempt | On-heap load | Off-heap load | Off-heap reload | On-heap process | Off-heap process | Off-heap reload process |
---|---|---|---|---|---|---|
1 | 1146.022381 | 1117.004359 | 4.099065 | 569.120851 | 656.430684 | 613.475144 |
2 | 1079.578922 | 1060.926854 | 1.761465 | 456.203168 | 655.596275 | 622.534246 |
3 | 1035.911388 | 1076.611629 | 1.750233 | 579.41094 | 655.955431 | 614.788388 |
4 | 1037.825728 | 1085.513933 | 2.074129 | 696.390519 | 688.664985 | 613.266972 |
5 | 1017.489384 | 1008.209808 | 1.717748 | 485.510526 | 620.800148 | 620.708538 |
6 | 1014.641653 | 1024.412669 | 1.740371 | 483.617261 | 619.696259 | 619.910897 |
7 | 1027.691397 | 1045.129567 | 1.727786 | 670.49456 | 622.48759 | 616.303549 |
8 | 984.005971 | 1009.265777 | 1.736832 | 513.543926 | 615.448442 | 613.06279 |
9 | 1027.841112 | 1027.057453 | 1.732985 | 486.502644 | 622.535269 | 620.285635 |
10 | 981.689573 | 1074.613506 | 1.71059 | 707.810107 | 613.417977 | 624.34832 |
11 | 1026.165712 | 1065.3181 | 1.689407 | 479.610417 | 621.454353 | 616.183786 |
12 | 994.949905 | 1046.898091 | 1.730394 | 498.938696 | 612.279425 | 619.965444 |
13 | 1035.144288 | 1043.119169 | 1.739726 | 472.821155 | 619.267425 | 613.029508 |
14 | 996.056368 | 1017.663948 | 1.699742 | 692.135015 | 619.725163 | 620.454352 |
15 | 1046.605644 | 1018.287866 | 1.713526 | 470.391592 | 619.723699 | 612.068366 |
16 | 1007.579733 | 1042.062818 | 1.70251 | 508.481346 | 619.481298 | 619.178419 |
17 | 1038.166702 | 1054.039165 | 1.683814 | 485.439337 | 620.901934 | 616.017789 |
18 | 1000.900448 | 1058.492139 | 1.7267 | 515.185816 | 622.204031 | 627.560895 |
19 | 1236.416447 | 1080.877889 | 1.643654 | 434.73928 | 624.825435 | 625.622426 |
20 | 997.663619 | 1038.478411 | 1.657257 | 497.232157 | 623.337627 | 620.943519 |
Mean | 1036.617319 | 1049.699158 | 1.8518967 | 535.1789657 | 628.7116725 | 618.4854492 |
Stddev | 59.71799264 | 28.44516049 | 0.535792004 | 86.95026923 | 19.55324941 | 4.52695571 |
So, it looks like the time to load synonyms is mostly unchanged (1050ms versus 1037ms), and loading "pre-compiled" synonyms is super-duper fast.
We do seem to take a 17.5% hit on processing time. (629ms versus 535ms.) I might try profiling to see where that time is being spent. If it's doing FST operations, I'll assume it's a cost of doing business. If it's spent loading the also off-heap output words, I might consider moving those (optionally?) back on heap.
I decided to try experimenting with moving the output words back onto the heap, since I didn't like the fact that every word lookup was triggering a seek.
Running now, I got way less variance on the on-heap runs. I also added some GCs between iterations, since I wanted to measure the heap usage of each. That likely removed some GC pauses from the on-heap version.
I then switched back to the off-heap words to confirm the results that I saw last time (and compare against the implementation with on-heap words).
The conclusion seems to be roughly:
- Existing on-heap FST averages about 444ms to process a lot of synonyms.
- Off-heap FST with on-heap words averages 515 or 516ms. (About 16% slower than existing on-heap.)
- Off-heap FST with off-heap words averages 620ms. (About 40% slower than existing on-heap.)
The on-heap FST seems to occupy about 36MB of heap. The off-heap FST with on-heap words occupies about 560kB. The off-heap FST with off-heap words occupies about 150kB.
With these trade-offs, I think off-heap FST with on-heap words may be a good choice for folks with large sets of synonyms. I don't think I would recommend off-heap FST with off-heap words.
Attempt | OnHeap FST load time | OffHeap FST (OnHeap words) load time | OffHeap FST (OnHeap words) reload time | OnHeap FST processing time | OffHeap FST (OnHeap words) processing time | OffHeap FST (OnHeap words) reloaded processing time | OffHeap FST (OffHeap words) processing time | OffHeap FST (OffHeap words) reloaded processing time |
---|---|---|---|---|---|---|---|---|
1 | 1191.339685 | 1072.285824 | 9.669646 | 436.391631 | 520.550704 | 516.11297 | 623.451546 | 620.531215 |
2 | 1030.432454 | 1033.619768 | 8.874105 | 448.848403 | 516.784387 | 517.230739 | 621.522464 | 622.793343 |
3 | 984.83645 | 1037.807342 | 8.912252 | 443.789813 | 512.066535 | 517.716981 | 622.455444 | 620.468985 |
4 | 1049.63589 | 1048.60113 | 8.894401 | 449.237547 | 518.946226 | 516.868933 | 617.837364 | 616.810236 |
5 | 990.22176 | 1049.618665 | 8.861166 | 448.923912 | 512.559801 | 511.114898 | 616.555422 | 617.122551 |
6 | 978.41877 | 1063.824595 | 8.930418 | 440.251675 | 517.632376 | 518.175232 | 621.969759 | 622.828416 |
7 | 985.434177 | 1049.113913 | 8.872906 | 443.209607 | 511.210536 | 518.802292 | 624.151468 | 622.097039 |
8 | 985.376238 | 1046.102696 | 8.823786 | 440.815454 | 517.491411 | 517.905752 | 623.390319 | 625.387487 |
9 | 983.341325 | 1065.892279 | 8.871586 | 449.145252 | 516.029267 | 516.916524 | 622.811992 | 622.798858 |
10 | 985.438642 | 1046.71167 | 8.8518 | 445.970679 | 512.045037 | 518.934149 | 622.592098 | 614.661805 |
11 | 990.592624 | 1050.377106 | 8.832753 | 443.844237 | 515.758106 | 510.808005 | 611.62254 | 622.956946 |
12 | 986.747374 | 1066.052969 | 8.884928 | 444.398327 | 517.259451 | 524.770132 | 622.085785 | 619.311172 |
13 | 984.328191 | 1052.189621 | 8.88281 | 439.612497 | 517.861131 | 515.796013 | 617.862222 | 615.101452 |
14 | 984.405339 | 1049.06783 | 8.835775 | 438.871305 | 517.885493 | 515.853446 | 615.254987 | 623.464483 |
15 | 997.323593 | 1064.473985 | 8.90682 | 443.640208 | 515.329143 | 518.807239 | 623.020916 | 623.013801 |
16 | 997.253932 | 1066.558928 | 8.900308 | 442.534843 | 511.930766 | 516.365803 | 624.316916 | 615.037306 |
17 | 999.464751 | 1046.464149 | 8.895899 | 443.48306 | 514.841946 | 517.082166 | 617.615908 | 618.661376 |
18 | 1001.896073 | 1045.304622 | 8.877555 | 444.875225 | 515.029862 | 510.365428 | 618.540866 | 624.355309 |
19 | 986.055833 | 1045.208347 | 8.863339 | 441.647553 | 511.489699 | 517.213428 | 623.61503 | 621.198543 |
20 | 984.112667 | 1047.317164 | 8.940865 | 451.304206 | 514.762544 | 510.45981 | 621.057397 | 621.483146 |
21 | 988.310511 | 1046.154648 | 8.865301 | 447.25874 | 514.859414 | 517.24163 | 623.916511 | 614.185296 |
22 | 982.874582 | 1062.113889 | 8.867098 | 439.785463 | 510.387721 | 516.885653 | 623.494968 | 622.527091 |
23 | 980.96967 | 1048.050631 | 8.867966 | 439.05464 | 511.423329 | 516.984465 | 621.567988 | 621.204435 |
24 | 983.189843 | 1046.083632 | 8.81578 | 440.574651 | 518.390122 | 520.392926 | 622.34785 | 614.923018 |
25 | 987.033178 | 1074.553767 | 8.812579 | 446.687106 | 513.914686 | 521.952744 | 615.870183 | 621.089011 |
26 | 985.771758 | 1076.245942 | 8.845264 | 444.718264 | 516.274395 | 513.5547 | 615.927497 | 615.53522 |
27 | 981.748774 | 1046.85677 | 8.818164 | 443.252924 | 513.632714 | 515.919924 | 626.659516 | 622.307368 |
28 | 983.979894 | 1062.317764 | 8.869256 | 443.267803 | 513.965345 | 509.688356 | 615.790469 | 615.712761 |
29 | 980.908776 | 1045.006602 | 8.855109 | 444.452376 | 517.488159 | 509.770143 | 621.96871 | 621.582871 |
30 | 981.508232 | 1046.722776 | 8.790313 | 443.952753 | 513.840793 | 512.847346 | 621.747601 | 621.901271 |
31 | 999.165558 | 1063.517734 | 8.792905 | 440.356205 | 517.677777 | 517.920992 | 620.90204 | 613.422668 |
32 | 1000.854281 | 1060.766663 | 9.027399 | 444.385706 | 510.006231 | 514.006688 | 623.684492 | 620.742008 |
33 | 1001.620724 | 1046.329083 | 8.72687 | 443.912072 | 509.793229 | 513.313214 | 620.695915 | 621.266234 |
34 | 1008.677463 | 1044.437966 | 8.799494 | 447.077333 | 516.263674 | 514.751767 | 622.775084 | 620.885167 |
35 | 987.309353 | 1048.062722 | 8.763122 | 440.748052 | 518.972785 | 518.608101 | 621.032898 | 620.85482 |
36 | 980.960836 | 1052.037316 | 8.834358 | 445.210623 | 518.850346 | 511.742763 | 620.719135 | 621.679027 |
37 | 983.807955 | 1049.894433 | 8.798039 | 440.302584 | 511.351473 | 510.557417 | 615.059624 | 619.802549 |
38 | 982.144377 | 1071.744423 | 8.81421 | 444.036711 | 518.551589 | 515.779265 | 614.579103 | 615.092139 |
39 | 984.460101 | 1051.399337 | 8.781058 | 439.998112 | 518.709639 | 511.192122 | 614.797646 | 621.154429 |
40 | 981.16924 | 1047.739329 | 8.827411 | 446.515425 | 512.815557 | 519.138446 | 621.769829 | 613.83963 |
Mean | 995.5780219 | 1053.415701 | 8.87387035 | 443.6585744 | 515.115835 | 515.7387151 | 620.4259376 | 619.7447621 |
StdDev | 34.59108879 | 10.41692769 | 0.139732265 | 3.344053675 | 2.951489941 | 3.528886642 | 3.48797555 | 3.346472427 |
@dungba88 -- I'm trying to resolve conflicts with your changes, but I'm a little stuck. I don't understand how we're supposed to use the FST APIs to write the FST to disk now.
After merging our changes, SynonymMap
contains:
FST<BytesRef> fst = FST.fromFSTReader(fstCompiler.compile(), fstCompiler.getFSTReader());
if (directory != null) {
fstOutput.close(); // TODO -- Should fstCompiler.compile take care of this?
try (SynonymMapDirectory.WordsOutput wordsOutput = directory.wordsOutput()) {
BytesRef scratchRef = new BytesRef();
for (int i = 0; i < words.size(); i++) {
words.get(i, scratchRef);
wordsOutput.addWord(scratchRef);
}
}
directory.writeMetadata(words.size(), maxHorizontalContext, fst);
return directory.readMap();
}
That call to FST.fromFSTReader(...)
fails with:
The DataOutput must implement FSTReader, but got FSIndexOutput(path="/home/froh/ws/lucene/lucene/analysis/common/build/tmp/tests-tmp/lucene.analysis.synonym.TestSynonymGraphFilter_1171182AD5892267-001/tempDir-001/synonyms.fst")
Is there something else that I'm supposed to be calling on the write path? Note that in the "off-heap" case above (when directory != null
), we just need to write the FST. The directory.readMap()
call loads it fresh from disk, discarding the FST that we constructed on heap.
@msfroh
As you only need to write the FST metadata, there is no need to create the FST. You can just call
directory.writeMetadata(words.size(), maxHorizontalContext, fstMetadata);
where as fstMetadata
is returned by fstCompiler.compile()
@msfroh
As you only need to write the FST metadata, there is no need to create the FST. You can just call
directory.writeMetadata(words.size(), maxHorizontalContext, fstMetadata);
where as
fstMetadata
is returned byfstCompiler.compile()
Hmm... I'm not sure that will work, because the logic to save the metadata is associated with the FST
instance (i.e. in the FST::saveMetadata
method).
I tried extracting that method out into a static, but the line outputs.writeFinalOutput(...)
breaks things.
You are right, the saveMetadata is still in FST.
Now to create the FST written off heap, you need to create the corresponding DataInput and use the FST constructor.
However the saveMetadata method can be moved to FSTMetadata as well since all of the information are stored there (including outputs)
I could put a PR for the saveMetadata change if you prefer.
I could put a PR for the saveMetadata change if you prefer.
I'll update to take care of that. Thanks for the pointers!
I realized I also need the saveMetadata
change for https://github.com/apache/lucene/pull/12985. Do you think we should make it a standalone PR and merge first? Otherwise I've cherry-picked from this PR :)
I realized I also need the saveMetadata change for https://github.com/apache/lucene/pull/12985. Do you think we should make it a standalone PR and merge first? Otherwise I've cherry-picked from this PR :)
Sure -- a standalone PR should work. We can both rebase onto that.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@dungba88 - I forgot about this change for a while. Did you create a separate PR for the saveMetadata change? Should I?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@msfroh I also forgot about this. Let me create a PR
I published a PR here: https://github.com/apache/lucene/pull/13549. Please take a look when you have time!
Note: The above PR has been merged
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!