wikiextractor
wikiextractor copied to clipboard
Last category in article not checked correctly
Consider this fake article:
<page>
<title>bug demo</title>
<ns>0</ns>
<id>245544</id>
<revision>
<id>71636704</id>
<parentid>69945883</parentid>
<timestamp>2019-02-12T11:24:39Z</timestamp>
<contributor>
<username>Uuu1996</username>
<id>1195068</id>
</contributor>
<comment>blah</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
[[Category:one]]
[[Category:two]]</text>
<sha1>acht3hj6108tk2v3ngv7b05aerdsefc</sha1>
</revision>
</page>
Because of the way the loop that builds the list of categories works (see catSet in pages_from), the last category is not added to the list of categories here and filtering on categories doesn't work. If a newline is added after [[Category:two]] here then the category is correctly added to catSet and will be detected.
I also experienced this issue, thanks for pointing the location of the responsible code. I fixed it by moving by moving the extract categories-codeblock up:
for line in input:
if not isinstance(line, text_type): line = line.decode('utf-8')
# extract categories
if line.lstrip().startswith('[[Cat'):
mCat = catRE.search(line)
if mCat:
catSet.add(mCat.group(1))
if '<' not in line: # faster than doing re.search()
if inText:
page.append(line)
continue
m = tagRE.search(line)
The current codebase is not stable (#216) for a PR to fix this.