wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

Help! DO dump files contain the wikitable in the wikipedia?

Open HamLaertes opened this issue 4 years ago • 1 comments

Hello everyone. I downloaded the first file enwiki-20210220-pages-articles1.xml-p1p41242.bz2 at the wiki server. I successfully got the extracted text after running the script. However, I found that the text seemed to ignore the table information in the wiki pages i.e. the wikitable. Do I miss something or the dump files not contain the table information at all? Thanks!

HamLaertes avatar Feb 27 '21 06:02 HamLaertes

I think I've got the answer myself. The dump files actually contain the wikitable information but in a different way. Adding the argument --html may help get the wikitable more directly. But the code seems to have bugs when converting xml to html. It reports KeyError as follows:

File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/storage/fbzhu/yc/wikiextractor/wikiextractor/WikiExtractor.py", line 467, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 857, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 847, in clean_text
    text = compact(text, mark_headers=mark_headers)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 256, in compact
    page.append(listItem[n] % line)
KeyError: '&'

I am using the xml files dumped at 20 Feb 2021 and wikiextractor version 3.0.5.

HamLaertes avatar Feb 27 '21 07:02 HamLaertes