Assorted dumpgenerator.py failures with some Miraheze (MediaWiki 1.39.3) wikis
Titles saved at... bigforestmirahezeorg_w-20230617-titles.txt
18795 page titles loaded https://bigforest.miraheze.org/w/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
42 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML Traceback (most recent call last):
File "dumpgenerator.py", line 2572, in <module>
main()
File "dumpgenerator.py", line 2564, in main
createNewDump(config=config, other=other)
File "dumpgenerator.py", line 2135, in createNewDump
generateXMLDump(config=config, titles=titles, session=other['session'])
File "dumpgenerator.py", line 742, in generateXMLDump
for xml in getXMLRevisions(config=config, session=session, start=start):
File "dumpgenerator.py", line 843, in getXMLRevisions
for page in arvrequest['query']['allrevisions']:
UnboundLocalError: local variable 'arvrequest' referenced before assignment
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
Not sure what's special about this wiki https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84
Maybe it was just an occasional error.
I tried another wiki (distrowiki.mirahrze.org) and nothing wrong happened. Maybe it's an occational error or an issue related to Python version, or something else.
I think I got an error HTTP 429, but we catch it and just proceed like nothing happened:
while True:
try:
arvrequest = site.api(http_method=config['http_method'], **arvparams)
except requests.exceptions.HTTPError as e:
if e.response.status_code == 405 and config['http_method'] == "POST":
print("POST request to the API failed, retrying with GET")
config['http_method'] = "GET"
continue
We should ideally implement a retry mechanism as we have in getXMLPage(), to avoid endless loops.
Image filenames and URLs saved at... denbagovmirahezeorg_w-20230618-images.txt
Retrieving images from "start"
Creating "./denbagovmirahezeorg_w-20230618-wikidump-2/images" directory
Traceback (most recent call last):
File "dumpgenerator.py", line 2572, in <module>
main()
File "dumpgenerator.py", line 2564, in main
createNewDump(config=config, other=other)
File "dumpgenerator.py", line 2147, in createNewDump
session=other['session'])
File "dumpgenerator.py", line 1524, in generateImageDump
r = session.get(config['api'] + u"?action=query&export&exportnowrap&titles=%s" % urllib.quote(title))
File "/usr/lib/python2.7/urllib.py", line 1306, in quote
return ''.join(map(quoter, s))
KeyError: u'\u0420'
tail: cannot open 'denbagovmirahezeorg_w-20230618-wikidump/denbagovmirahezeorg_w-20230618-history.xml' for reading: No such file or directory
I don't understand the HTTP 502 errors
Analysing https://ubrwiki.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Retrieving image filenames
...................................... Found 1851 images
1851 image names loaded
Image filenames and URLs saved at... ubrwikimirahezeorg_w-20230618-images.txt
Retrieving images from "start"
Creating "./ubrwikimirahezeorg_w-20230618-wikidump/images" directory
Downloaded 10 images
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 1, XML for "Image:1,00_M$.png" is wrong. Waiting 20 seconds and reloading...
Downloaded 20 images
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 1, XML for "Image:1900.png" is wrong. Waiting 20 seconds and reloading...
Downloaded 30 images
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 1, XML for "Image:2_turno.png" is wrong. Waiting 20 seconds and reloading...
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 2, XML for "Image:2_turno.png" is wrong. Waiting 40 seconds and reloading...
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 3, XML for "Image:2_turno.png" is wrong. Waiting 60 seconds and reloading...
Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
In attempt 4, XML for "Image:2_turno.png" is wrong. Waiting 80 seconds and reloading...
HTTP Error 502.
Server error, max retries exceeded.
Please resume the dump later.
https://ubrwiki.miraheze.org/w/index.php?action=submit&curonly=1&limit=1&pages=Image%3A20M%24.png&title=Special%3AExport
ouch
Trying to export all revisions from namespace 2303
Trying to get wikitext from the allrevisions API and to build the XML
XML dump saved at... avidwiki_w-20230620-history.xml
Retrieving image filenames
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................HTTP Error 429.
Server error, max retries exceeded.
Please resume the dump later.
https://www.avid.wiki/w/api.php?aiprop=url%7Cuser&format=json&aifrom=WBRZ_2013.png&list=allimages&ailimit=50&action=query
Changed directory to /mnt/at/wikiteam/avidwiki_w-20230620-wikidump
606332
606332
606332
https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84
Not reproduced in the latest MW-Scraper.
Trying to export all revisions from namespace -1 (magic number refers to "all")
Trying to get wikitext from the allrevisions API and to build the XML
틀:동음이의, 30 edits (--xmlrevisions)
틀:반대, 1 edits (--xmlrevisions)
틀:찬성, 1 edits (--xmlrevisions)
틀:의견, 4 edits (--xmlrevisions)
틀:삭제, 4 edits (--xmlrevisions)
틀:유지, 2 edits (--xmlrevisions)
틀:이동, 1 edits (--xmlrevisions)
틀:넘겨주기, 1 edits (--xmlrevisions)
틀:중립, 1 edits (--xmlrevisions)
틀:병합, 1 edits (--xmlrevisions)
틀:질문, 1 edits (--xmlrevisions)
틀:분할, 1 edits (--xmlrevisions)
......
Not sure what's special about this wiki https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84
Maybe it was just an occasional error.
If e.response.status_code == 405 and config['http_method'] == "POST" is False, arvrequest will become unbound. (Escaped continue)