wikiextractor
wikiextractor copied to clipboard
Templates don't get expanded
Any idea why none of the templates get expanded? I ran WikiExtractor.py an initial time and saved all templates to a file (named "templates", it's 2358539 lines long) to try to debug. I'm trying to extract all wiktionary articles but the resulting text looks like this (blank text in place of templates):
"
This was the command I ran: python WikiExtractor.py -o extracted --debug --templates templates enwiktionary-sample-pages-articles.xml
This was the output:
INFO: Loading template definitions from: templates
INFO: Loaded 74373 templates in 24.6s
INFO: Starting page extraction from enwiktionary-sample-pages-articles.xml.
INFO: Using 7 extract processes.
INFO: 16 dictionary
INFO: 19 free
INFO: 20 thesaurus
DEBUG: EXPAND also|Dictionary
INFO: 27 encyclopedia
DEBUG: Quit extractor
INFO: 29 portmanteau
DEBUG: Quit extractor
DEBUG: <EXPAND Template:Also
DEBUG: EXPAND wikipedia|dab=Dictionary (disambiguation)|Dictionary
DEBUG: <EXPAND Template:Wikipedia
DEBUG: EXPAND PIE root|en|deyḱ
DEBUG: TEMPLATE Template:PIE root: {{catlangname|{{{1|}}}|terms derived from the PIE root *{{{2|}}}-{{#if:{{{id|{{{id1|}}}}}}| ({{{id|{{{id1|}}}}}})}}}}{{#if:{{{3|}}}|{{catlangname|{{{1|}}}|terms derived from the PIE root *{{{3}}}-{{#if:{{{id2|}}}| ({{{id2|}}})}}}}}}{{#if:{{{4|}}}|{{catlangname|{{{1|}}}|terms derived from the PIE root *{{{4}}}-{{#if:{{{id3|}}}| ({{{id3|}}})}}}}}}
DEBUG: EXPAND catlangname|en|terms derived from the PIE root *deyḱ-{{#if:| ()}}
DEBUG: <EXPAND Template:Catlangname
DEBUG: EXPAND #if:|{{catlangname|en|terms derived from the PIE root *-{{#if:| ()}}}}
DEBUG: EXPAND also|-free
DEBUG: <EXPAND #if
DEBUG: EXPAND #if:|{{catlangname|en|terms derived from the PIE root *-{{#if:| ()}}}}
DEBUG: EXPAND also|Thesaurus|thésaurus
DEBUG: <EXPAND #if
DEBUG: <EXPAND Template:PIE root
DEBUG: EXPAND bor|en|ML.|dictionarium|withtext=1
DEBUG: <EXPAND Template:Bor
DEBUG: EXPAND der|en|la|dictionarius
DEBUG: EXPAND was wotd|2007|March|8
DEBUG: <EXPAND Template:Der
DEBUG: EXPAND wikipedia
I have been working on extracting templates for months and this looks like an amazing tool if I can get it to work. Thanks for all the work you all are doing on it!
@dnishiyama Do you still have this issue? I also encountered a similar problem, and it seems that there is an issue with the current script when it's applied to Wiktionary dumps. Specifically, when it expands templates, it tries to "normalize" template titles by converting the first letter of the template to upper case, although template titles are stored without normalization.
After removing those applications of ucfirst
things seem to be working correctly for me.
Thanks for the reply. I do still have the issue and have since moved on to a different technique to gather this data from wikitionary (scrapy + bs4). If I get a chance I'll check out your recommendation. This would be a much better option if it does work.
I am also encountering this problem on the July 20th, 2018 English Wikipedia dump. Here was my command:
python WikiExtractor.py --o 'articles/' --templates 'templates.temp' --filter_disambig_pages --json 'enwiki.xml'
Here is an example of an incorrectly extracted sentence from Wikipedia Page ID 12.
WikiExtractor Output: The word "anarchism" is composed from the word "anarchy" and the suffix -ism, themselves derived respectively from the Greek , i.e. "anarchy" (from , "anarchos", meaning "one without rulers"; from the privative prefix ἀν- ("an-", i.e. "without") and , "archos", i.e. "leader", "ruler"; (cf. "archon" or , "arkhē", i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix or ("-ismos", "-isma", from the verbal infinitive suffix , "-izein").
Real Wikipedia Value: The word "anarchism" is composed from the word "anarchy" and the suffix -ism, themselves derived respectively from the Greek ἀναρχία, i.e. anarchy (from ἄναρχος, anarchos, meaning "one without rulers"; from the privative prefix ἀν- (an-, i.e. "without") and ἀρχός, archos, i.e. "leader", "ruler"; (cf. archon or ἀρχή, arkhē, i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix -ισμός or -ισμα (-ismos, -isma, from the verbal infinitive suffix -ίζειν, -izein).
I've also found other types of template expansions missing such as distance measurements.
It seems that the template expansions don't work well now. I found a lot of wrongly parsed text in the output.
Hi,
I found an old version at http://medialab.di.unipi.it/wiki/Wikipedia_Extractor. It works well.
You need to use python2 to run it.
Thanks!
@dnishiyama Do you still have this issue? I also encountered a similar problem, and it seems that there is an issue with the current script when it's applied to Wiktionary dumps. Specifically, when it expands templates, it tries to "normalize" template titles by converting the first letter of the template to upper case, although template titles are stored without normalization.
After removing those applications of
ucfirst
things seem to be working correctly for me.
Hi, thank you for your suggestion! I tried to disable the ucfirst function. Basically, let the string keep unchanged, but it still doesn't work.
Would you mind to share the updated code on GitHub? I would be really appreciated it.
Thank you!