document that doc_freq() includes deletes
The method searcher.doc_freq() doesn't seem to take all delete operations into account. When I do:
- add two documents, one with field=a, one with field=b
- delete all where field=a
- commit
then searcher.doc_freq(field=a) returns 1 instead of 0
The failing test case is here: https://github.com/brainlock/tantivy/commit/222d8fe03c57f569bb2fd69358b82fc07d629bfa
(it's based on latest master, but I also reproduced it with the version we're currently using (0.11))
I also tried adding a commit() after the two add operations and before the delete, with the same result.
If I close the index and reopen it, I get the same result. From this, I deduced that the frequency information is already incorrect when serialized. I confirmed this by setting breakpoints during commit(), TermInfoStoreWriter is already carrying the wrong frequency.
More potentially useful observations:
-
if I change the test to add only documents where field=a (1 in the test code), the doc_freq call at the end correctly returns 0
-
if I change the test to create 100 documents with field=a and then use a multithreaded writer, I get strange results: with 1 thread, the doc_freq at the and returns 100 instead of 0. With 2 threads: 99, with 3 threads: 98, with 4 threads: 97...
Yes we ignore delete in doc_freq. We should update the documentation accordingly.