pyserini icon indicating copy to clipboard operation
pyserini copied to clipboard

Retrieve term positions within documents

Open shachibista opened this issue 3 years ago • 4 comments

Hi!

Is there any way to retrieve term positions from the documents? I want to implement a term-highlighting functionality. Is there something for this already built-in?

Thanks!

shachibista avatar Oct 31 '22 08:10 shachibista

A possible solution seems to get the term positions using the Index Reader API and use the analyser to get the offsets of the terms in the document.

from pyserini.analysis import get_lucene_analyzer
from pyserini.pyclass import autoclass

JOffsetAttribute = autoclass('org.apache.lucene.analysis.tokenattributes.OffsetAttribute')

analyser = get_lucene_analyzer()
text = "City buses are running on time"
ts = analyser.tokenStream(None, text)

ts.reset()
while ts.incrementToken():
    offset = ts.attributes.get(JOffsetAttribute)

    start = offset.startOffset()
    end = offset.endOffset()

    print((start, end, text[start:end]))
ts.end()
ts.close()

shachibista avatar Nov 01 '22 15:11 shachibista

Hi @shachibista - thanks for your issue, hope you're finding Pyserini useful.

Looping in @ola13 - which faced exactly the same issue recently... thoughts?

lintool avatar Nov 01 '22 21:11 lintool

Hi @shachibista! I'm working on open sourcing my server code, will answers with links to that code in a day or two

ola13 avatar Nov 02 '22 11:11 ola13

Hi - so - I'm using this code to get highlighted terms https://github.com/huggingface/roots-search-tool/blob/main/web/server.py#L101-L103

What it does is the following:

  1. use analyzer to get query terms
  2. use analyzer to get document terms
  3. save each word from a document which matches a query term
  4. highlight anything in that list

The problem with this solution is that I don't actually get any feedback on which term was matched in a given document, I highlight any possible match instead - if that makes sense?

ola13 avatar Nov 02 '22 18:11 ola13