IR-Reproducibility
IR-Reproducibility copied to clipboard
Missing relevant documents (in particular for positional runs)
I know this might open a can of worm, but just to let you know: when we saw positional runs going worse than non-positional runs, we started to examine manually documents: we did a partial re-evaluation on the results that made MAP worse.
The result is that with positions the results are better, but the documents you get are not marked as relevant because no one found them at TREC. Indeed, for us the "going worse" behaviour for us is strong in 2004, OK in 2005, but becomes an improvement in 2006. Galago has EXACTLY the same behaviour (albeit with different scores).
My 2€¢ is that initially nobody was using positions. As the track went on, more and more system used positions (I remember Indri winning in 2005 or 2006 using ordered and unordered windows).
Thus, the dataset penalizes strongly in 2004 systems using positions. Less in 2005. In 2006 positions give improvements.
Don't ask me p-values for this, but it is a fact that, unfortunately, pooled evaluation grows old very quickly :(.