ProcessDocumentHeader() in WikipediaDumpProcessor should use analyzer.

Open MikeHopcroft opened this issue 9 years ago • 0 comments

Currently ProcessDocumentHeader() does not use the Lucene analyzer for the document title. This leads to problems with terms that contain colons. As an example, in the file AA\wiki_83, document 11327 https://en.wikipedia.org/?curid=11327 has the title "Wikipedia:Free On-line Dictionary of Computing/symbols - B". Since this title is not passed through the Lucene tokenizer, the colon makes it through and we end up with the term "Wikipedia:Free" in the Document Frequency Table. When we use the Document Frequency Table as a source of test queries, we try to parse the query "Wikipedia:Free" and fail because the parser thinks that "Wikipedia" is a stream name prefix.

Oct 26 '16 04:10 MikeHopcroft