searchhub icon indicating copy to clipboard operation
searchhub copied to clipboard

Indexer picks up Apache WIKI's DIFF pages

Open arafalov opened this issue 7 years ago • 1 comments

It seems that the indexer follows and indexes History/Diff pages on the WIKI. That's bad both for the index and probably for the Wiki performance itself.

Sample queue: http://find.searchhub.org/search?query=(fq:(%270%27:(key:%27%7B!!tag%3Dprj%7Dproject_label%27,tag:prj,transformer:localParams,values:(%270%27:%27Apache%20Tika%27)),%271%27:(key:%27%7B!!tag%3Dds%7Ddatasource_label%27,tag:ds,transformer:localParams,values:(%270%27:%27Tika%20Wiki%27))),q:%27parse%20context%27,rows:10,start:0,uuid:%2796b3372a-0cf8-4be1-9ec3-46b3c32d7e55%27,wt:json)

What I see is multiple History and Diff pages for "RecursiveMetadata - Tika Wiki" page.

arafalov avatar Mar 13 '17 14:03 arafalov

@lasyamarla The fix here should remove all "action=" pages from the index, which gets rid of diffs as well as a few other kinds of actions that shouldn't be indexed. https://github.com/lucidworks/searchhub/tree/GH-128-index-wiki-diffs

yannyu avatar Jul 27 '17 01:07 yannyu