milli
milli copied to clipboard
Reduce the size of the `word_pair_proximity` database
Currently, given the following text:
The lazy dog
The keys of the word_pair_proximity_docids
database are:
dog the 3
dog lazy 2
lazy dog 1
lazy the 2
the dog 2
the lazy 1
For each pair of words, we store both the "rightward" proximity and the "leftward" proximity, which is equal to the rightward proximity + 1.
I am wondering whether we could reduce the size of the database by only storing the righward proximity. Given the same text, the database would contain the keys:
lazy dog 1
the dog 2
the lazy 1
This would help in two ways:
- we reduce the size of the database on disk, which could be important because it is a very large one
- it would also speed up indexing significantly
I don't think that we would lose any information. In fact, it seems like we would have more information at our disposal to tweak the search results. We can, for example, restrict a search to documents where word1
is followed by word2
, or we could tweak the "proximity penalty" of backward word pairs, which is currently set to + 1
(I mean that given the word pair word1 .. word2
with a proximity of N
, we also record word2 word1 N+1
).
Another example:
ID 0:
run forest run
ID 1:
run in forest
currently yields:
forest in 2 [1]
forest run 1 [0]
forest run 3 [1]
in forest 1 [1]
in run 2 [1]
run forest 1 [0]
run forest 2 [1]
run in 1 [1]
run run 2 [0]
and with the suggested change, would yield:
run forest 1 [0]
run run 2 [0]
forest run 1 [0]
run in 1 [1]
run forest 2 [1]
in forest 1 [1]
To find the document ids where run
is 2
-close to forest
, we currently search for run forest 2
and get:
run forest 2 [1]
With the new scheme, we need to execute three queries:
-
run forest 2
-
forest run 1
-
run forest 1
Then we compute 1 | (2 - 3)
:
1. run forest 2 [1]
2. forest run 1 [0]
3. run forest 1 [0]
1 | (2 - 3)
= [1] | ([0] - [0])
= [1]
The idea is that we want:
- all the documents where
run forest
has a rightward proximity of2
- all the documents where
run forest
has a leftward proximity of1
except those whererun forest
has a rightward proximity of1
, since the overall minimal proximity betweenrun
andforest
in those documents is1
and not2
.
This would be slower of course, but I don't know if the performance loss would be noticeable.
Hello @loiclec,
First, I think the performance loss would rarely be noticeable. The reason is that in a lot of cases, your equation 1 | (2 - 3)
would be simplified in 1 | 2
because we can consider that run
1
-close to forest
has already been processed in a previous iteration and so the involved documents have already been removed from the current iteration.
Moreover, I think this improvement could enhance the relevancy. Making the distinction between a swap and a distance of 2 would reduce the probability of returning false positives when computing a PHRASE (in a way, this function may be refactored with this change).
To conclude, I think this is a good improvement and I don't see any drawbacks in terms of relevancy. 😄
That's good to hear :-)
Have you also thought about the impact on the word_prefix_pair_proximity_docids
database and its usage? Since this database is computed from the word_pair_proximity_docids
, it will also be smaller.
I am not too familiar with its exact role in the execution of a search query. I wonder if we'll also want to use the 1 | (2 - 3)
formula (or the simplification you talked about) or if we want to limit ourselves to searching for non-swapped word pairs there (just 1
instead of 1 | 2 - 3
)
Hey @loiclec, I think if both databases have the same logic it's ok, the resulted documents comming from these databases are mainly computed in search/criteria/mod.rs#L450-L524
Below is an update on this issue. It doesn't require any answer, I am mostly writing it here to document the process :)
Correctness of the word prefix pair database
@ManyTheFish and I discussed a bit more about this issue and found that there is a problem with the word_prefix_pair-proximity_docids
database following the proposed refactor. For example, if someone types:
New York ci
and we have a document in the dataset that contains:
the city of New York
then we would expect that document to be returned among the results based on the prefix ci
matching city
and being in close proximity to new
and york
. However, if ci
is in the word_prefix_pair_proximity_docids
database, then it would only appear as:
the ci 1
and not:
new ci 3
york ci 4
which is what we want.
To solve this problem, we saw two potential solutions:
- drop the
word_prefix_pair_proximity_docids
database and always find the words corresponding to a prefix via the FST - create another database that is equivalent to
word_prefix_pair_proximity_docids
except that the prefixes come first and are on the left, maybe calledprefix_word_pair_proximity_docids
.
I have started implementing (2) in #639 .
Correctness of the formula I originally proposed
Furthermore, I also realised that my formula 1 | (2 -3)
is pretty much completely wrong. For example:
TEXT: the man who sometimes plays the violin
ARGS: man the 4
A: ("man", "the", 4) => true
B: ("the", "man", 3) => false
C: ("man", "the", 3) => false
Actual result: A | (B - C) => true
Correct result: false (because `the man` means `man` has a proximity of 2 with `the`)
The point is that subtracting 3
to remove the false positives is not enough. A better implementation is to:
- Fetch the docids for
(word1, word2, prox)
- Subtract from those all the docids for
(word2, word1, p)
wherep < prox - 1
- Subtract from those all the docids for
- Fetch the docids for
(word2, word1, prox - 1)
- Subtract from those all the docids for
(word1, word2, p)
wherep < prox
- Subtract from those all the docids for
- Union the results of
1
and2
I am still not certain that is 100% correct or that it is the fastest way to do it.
However, it doesn't really matter because, as @ManyTheFish said:
The reason is that in a lot of cases, your equation 1 | (2 - 3) would be simplified in 1 | 2 because we can consider that run 1-close to forest has already been processed in a previous iteration and so the involved documents have already been removed from the current iteration.
So we'll just use 1 | 2
as originally proposed proposed and see if any problem arises.
Benchmarks
I launched some benchmarks where nothing changed except that we don't store leftward proximities anymore, to get an idea of the potential performance improvements made possible by this change. The results are:

I would be very happy if we can really reduce the time to index the wiki dataset from 14.5 minutes to 9.4 minutes :)