wikiextractor Templates don't get expanded

Any idea why none of the templates get expanded? I ran WikiExtractor.py an initial time and saved all templates to a file (named "templates", it's 2358539 lines long) to try to debug. I'm trying to extract all wiktionary articles but the resulting text looks like this (blank text in place of templates):

" dictionary , from , from , from , perfect past participle of + . For more, see

This was the command I ran: python WikiExtractor.py -o extracted --debug --templates templates enwiktionary-sample-pages-articles.xml

This was the output: INFO: Loading template definitions from: templates INFO: Loaded 74373 templates in 24.6s INFO: Starting page extraction from enwiktionary-sample-pages-articles.xml. INFO: Using 7 extract processes. INFO: 16 dictionary INFO: 19 free INFO: 20 thesaurus DEBUG: EXPAND also|Dictionary INFO: 27 encyclopedia DEBUG: Quit extractor INFO: 29 portmanteau DEBUG: Quit extractor DEBUG: <EXPAND Template:Also DEBUG: EXPAND wikipedia|dab=Dictionary (disambiguation)|Dictionary DEBUG: <EXPAND Template:Wikipedia DEBUG: EXPAND PIE root|en|deyḱ DEBUG: TEMPLATE Template:PIE root: {{catlangname|{{{1|}}}|terms derived from the PIE root *{{{2|}}}-{{#if:{{{id|{{{id1|}}}}}}| ({{{id|{{{id1|}}}}}})}}}}{{#if:{{{3|}}}|{{catlangname|{{{1|}}}|terms derived from the PIE root *{{{3}}}-{{#if:{{{id2|}}}| ({{{id2|}}})}}}}}}{{#if:{{{4|}}}|{{catlangname|{{{1|}}}|terms derived from the PIE root *{{{4}}}-{{#if:{{{id3|}}}| ({{{id3|}}})}}}}}}
DEBUG: EXPAND catlangname|en|terms derived from the PIE root *deyḱ-{{#if:| ()}} DEBUG: <EXPAND Template:Catlangname DEBUG: EXPAND #if:|{{catlangname|en|terms derived from the PIE root *-{{#if:| ()}}}} DEBUG: EXPAND also|-free DEBUG: <EXPAND #if DEBUG: EXPAND #if:|{{catlangname|en|terms derived from the PIE root *-{{#if:| ()}}}} DEBUG: EXPAND also|Thesaurus|thésaurus DEBUG: <EXPAND #if DEBUG: <EXPAND Template:PIE root
DEBUG: EXPAND bor|en|ML.|dictionarium|withtext=1 DEBUG: <EXPAND Template:Bor DEBUG: EXPAND der|en|la|dictionarius DEBUG: EXPAND was wotd|2007|March|8 DEBUG: <EXPAND Template:Der DEBUG: EXPAND wikipedia

I have been working on extracting templates for months and this looks like an amazing tool if I can get it to work. Thanks for all the work you all are doing on it!

Feb 18 '18 21:02 dnishiyama

@dnishiyama Do you still have this issue? I also encountered a similar problem, and it seems that there is an issue with the current script when it's applied to Wiktionary dumps. Specifically, when it expands templates, it tries to "normalize" template titles by converting the first letter of the template to upper case, although template titles are stored without normalization.

After removing those applications of ucfirst things seem to be working correctly for me.

Jul 17 '18 01:07 mhagiwara

Thanks for the reply. I do still have the issue and have since moved on to a different technique to gather this data from wikitionary (scrapy + bs4). If I get a chance I'll check out your recommendation. This would be a much better option if it does work.

Jul 17 '18 04:07 dnishiyama

I am also encountering this problem on the July 20th, 2018 English Wikipedia dump. Here was my command:

python WikiExtractor.py --o 'articles/' --templates 'templates.temp' --filter_disambig_pages --json 'enwiki.xml'

Here is an example of an incorrectly extracted sentence from Wikipedia Page ID 12.

WikiExtractor Output: The word "anarchism" is composed from the word "anarchy" and the suffix -ism, themselves derived respectively from the Greek , i.e. "anarchy" (from , "anarchos", meaning "one without rulers"; from the privative prefix ἀν- ("an-", i.e. "without") and , "archos", i.e. "leader", "ruler"; (cf. "archon" or , "arkhē", i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix or ("-ismos", "-isma", from the verbal infinitive suffix , "-izein").

Real Wikipedia Value: The word "anarchism" is composed from the word "anarchy" and the suffix -ism, themselves derived respectively from the Greek ἀναρχία, i.e. anarchy (from ἄναρχος, anarchos, meaning "one without rulers"; from the privative prefix ἀν- (an-, i.e. "without") and ἀρχός, archos, i.e. "leader", "ruler"; (cf. archon or ἀρχή, arkhē, i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix -ισμός or -ισμα (-ismos, -isma, from the verbal infinitive suffix -ίζειν, -izein).

I've also found other types of template expansions missing such as distance measurements.

Aug 03 '18 22:08 KylePiira

It seems that the template expansions don't work well now. I found a lot of wrongly parsed text in the output.

Dec 14 '18 06:12 wanicca

Hi,

I found an old version at http://medialab.di.unipi.it/wiki/Wikipedia_Extractor. It works well.

You need to use python2 to run it.

Thanks!

Nov 18 '19 20:11 chaojiang06

@dnishiyama Do you still have this issue? I also encountered a similar problem, and it seems that there is an issue with the current script when it's applied to Wiktionary dumps. Specifically, when it expands templates, it tries to "normalize" template titles by converting the first letter of the template to upper case, although template titles are stored without normalization.

After removing those applications of ucfirst things seem to be working correctly for me.

Hi, thank you for your suggestion! I tried to disable the ucfirst function. Basically, let the string keep unchanged, but it still doesn't work.

Would you mind to share the updated code on GitHub? I would be really appreciated it.

Thank you!

Nov 18 '19 20:11 chaojiang06

wikiextractor wikiextractor copied to clipboard

Templates don't get expanded

wikiextractor
wikiextractor copied to clipboard