wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

Last category in article not checked correctly

Open polm opened this issue 6 years ago • 1 comments

Consider this fake article:

<page>
    <title>bug demo</title>
    <ns>0</ns>
    <id>245544</id>
    <revision>
      <id>71636704</id>
      <parentid>69945883</parentid>
      <timestamp>2019-02-12T11:24:39Z</timestamp>
      <contributor>
        <username>Uuu1996</username>
        <id>1195068</id>
      </contributor>
      <comment>blah</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">

[[Category:one]]
[[Category:two]]</text>
      <sha1>acht3hj6108tk2v3ngv7b05aerdsefc</sha1>
    </revision>
</page>

Because of the way the loop that builds the list of categories works (see catSet in pages_from), the last category is not added to the list of categories here and filtering on categories doesn't work. If a newline is added after [[Category:two]] here then the category is correctly added to catSet and will be detected.

polm avatar Oct 26 '19 10:10 polm

I also experienced this issue, thanks for pointing the location of the responsible code. I fixed it by moving by moving the extract categories-codeblock up:

    for line in input:
        if not isinstance(line, text_type): line = line.decode('utf-8')
        # extract categories
        if line.lstrip().startswith('[[Cat'):
            mCat = catRE.search(line)
            if mCat:
                catSet.add(mCat.group(1))
        if '<' not in line:  # faster than doing re.search()
            if inText:
                page.append(line)
            continue
        m = tagRE.search(line)

The current codebase is not stable (#216) for a PR to fix this.

sandertan avatar Sep 15 '20 07:09 sandertan