sphinx icon indicating copy to clipboard operation
sphinx copied to clipboard

New exact phrase searching feature (for HTML)

Open AA-Turner opened this issue 1 year ago • 11 comments

Re-opened version of #4254. (See https://github.com/sphinx-doc/sphinx/pull/4254#issuecomment-2226791445)

New exact phrase searching feature (for HTML)

I've just rebased the old PR and updated. However I'm not sure that this is the best implementation now, given that we have split "display" logic from "search" logic -- so if best to close this PR and start anew then I won't object.

Closes #3301

A

AA-Turner avatar Jul 13 '24 06:07 AA-Turner

This is a clever way to implement the feature without having to change the format of the search index (searchindex.js) file! My main concern is that it relies on the search summary functionality (HTTP GET of the complete content of each result), meaning that the retrieval behaviour of some search queries becomes entangled with what is otherwise mainly a result-formatting config setting (html_show_search_summary - on by default, but even so, as a user/project maintainer I would not expect that setting to alter search query capabilities).

An alternative implementation I have in mind would involve storing the location of each term in the documents it was found in as part of searchindex.js -- and then a phrase-query would check that all of the words within the phrase appear adjacently at least once. However.. that's significantly more effort.

Also agreed with @picnixz that some test coverage would be good if+when we add this.

Edit: rephrase; I shouldn't have suggested that this is an incomplete approach.

jayaddison avatar Jul 13 '24 09:07 jayaddison

An alternative implementation I have in mind would involve storing the location of each term in the documents it was found in as part of searchindex.js -- and then a phrase-query would check that all of the words within the phrase appear adjacently at least once. However.. that's significantly more effort.

Note: stopwords (the, a, it...) are a challenge with this approach, because their positions aren't stored. The trick is to remove them from each phrase in the input query too.

Then the contents of a hypothetical document 15, with contents The example on page A is an useful example!, might tokenize to _ example _ page _ _ _ useful example => document term positions example: {15: [2, 9]}, page: {15: 4}, useful: {15: 8}...

...and a query for example on page could tokenize to example _ page => query term positions {example: 1}, {<ANY>: 2}, {page: 3} -- and now we need to match documents where both example and page appear, and then filter those results to cases where each matched term from the tokenized phrase has a corresponding next-match (allowing the ANY wildcard) with the same offset. And then an exact-match phase to eliminate incorrect wildcard matches (example code page in unrelated document 14).

Perhaps I should try to link to some kind of information retrieval coursebook or online resource, but I wanted to mention some of that to have it in context here while it's on my mind.

It's doable and probably quite a challenging and satisfying implementation, but there would be quirks and details.

Edit: add exact-match post-filter step Edit: don't imply in the description that we'll definitely implement this

jayaddison avatar Jul 13 '24 11:07 jayaddison

The trick is to remove them from each phrase in the input query too.

This may lead to a lot of false positive.

Perhaps I should try to link to some kind of information retrieval coursebook or online resource, but I wanted to mention some of that to have it in context here while it's on my mind.

One idea is to use n-gram predictions but I'm not sure if it will be sufficient (and efficient).

picnixz avatar Jul 13 '24 12:07 picnixz

Despite my initial flip-out about an implementation that isn't index-driven, I would note that this is the most-requested search-related feature in the bugtracker. That's bringing me more towards acceptance of it.

jayaddison avatar Jul 15 '24 14:07 jayaddison

If a user runs a query for golang "code example" and no pages contain the phrase 'code example', but pages do contain 'golang', do we have a preferred outcome? (zero results, or include the non-phrase results)

jayaddison avatar Jul 17 '24 11:07 jayaddison

For me, quotes should be what I want to search, even if it contains the rest. Quotes mean "I want that exact string, I don't want anything else". So maybe we could add a warning saying "remove quotes"?

picnixz avatar Jul 17 '24 13:07 picnixz

Google presents the following ideas:

image

image

A

AA-Turner avatar Jul 17 '24 14:07 AA-Turner

Hmm. Would searching for "that with" on a large (English-language example, but generalizable to others) documentation set potentially launch many, many HTTP GET requests with this?

jayaddison avatar Jul 17 '24 14:07 jayaddison

Hmm. Would searching for "that with" on a large (English-language example, but generalizable to others) documentation set potentially launch many, many HTTP GET requests with this?

Hm. Fortunately not, thanks to both of those being EN-language stopwords.

jayaddison avatar Jul 17 '24 14:07 jayaddison

For me, quotes should be what I want to search, even if it contains the rest. Quotes mean "I want that exact string, I don't want anything else". So maybe we could add a warning saying "remove quotes"?

That seems simple, and as a user, if I've intentionally used quotes to try to get exact-match results, then I am probably reasonably likely to be able to figure out that all of them must match if I use multiple quoted phrases in my query.

I think my largest concern about these changes remains the effiency/time-cost of the client reading through the entire contents of documents for matches. I could draft an ngram-based solution? (this time using inter-term ngrams, as compared to intra-term ngrams in #12596).

jayaddison avatar Jul 17 '24 14:07 jayaddison

I think my largest concern about these changes remains the effiency/time-cost of the client reading through the entire contents of documents for matches. I could draft an ngram-based solution? (this time using inter-term ngrams, as compared to intra-term ngrams in #12596).

Idea: when indexing tri-grams in the manner proposed in #12596, add the following handling:

  • During indexing, keep track of the word before/preceding each term, provided that it is part of the same block of text (paragraph/sentence).
  • If the trigram being created is at the start/prefix of a word, then include the trigram of the suffix of the previous word in the term list (so, when indexing the phrase context matters, term offsets zero and one respectively, the trigram for ext would point to term-offset-zero, the trigram for ers would point to term-offset-one, and the trigram for mat -- seemingly oddly -- would point to both term-offsets zero and term one).
  • During phrase queries, we would begin by collecting all of the starting-edge trigrams from the query phrase. If we query again for "context matters", this returns [0], [0,1] -- and we can observe that there is a valid, ordered path along the ordered terms. If we queried for "context indeed matters", then we would retrieve [0], [2], [0, 1] or something along those lines -- the second result ([2]) doesn't contain a path back to the first result ([0]), and so this pair of terms does not exist in the document collection.

The flaw in all of the above reasoning is that it is global across the entire document collection. It may be preferable to have per-document filtering, because otherwise query performance may be inconsistent (some very fast queries where the phrase is known not to exist at all -- but then slow queries where we still have to check every document).

jayaddison avatar Jul 19 '24 15:07 jayaddison