Workbench
Workbench copied to clipboard
ProcessDocumentHeader() in WikipediaDumpProcessor should use analyzer.
Currently ProcessDocumentHeader() does not use the Lucene analyzer for the document title. This leads to problems with terms that contain colons. As an example, in the file AA\wiki_83, document 11327 https://en.wikipedia.org/?curid=11327 has the title "Wikipedia:Free On-line Dictionary of Computing/symbols - B". Since this title is not passed through the Lucene tokenizer, the colon makes it through and we end up with the term "Wikipedia:Free" in the Document Frequency Table. When we use the Document Frequency Table as a source of test queries, we try to parse the query "Wikipedia:Free" and fail because the parser thinks that "Wikipedia" is a stream name prefix.