Python 3 support for dumpgenerator.py
This should add python 3 support to dumpgenerator.py without breaking python 2 behavior.
Thanks for this patch :) It worked for me, too bad upstream is inactive...
Doron Behar, 15/10/19 16:49:
Thanks for this patch :) It worked for me, too bad upstream is inactive...
Thank you for testing! Can you also test whether it makes some of the Unicode bugs better or worse? As long as the tests are broken I avoid merging things until the next time I'm actively using dumpgenerator, but the bug reports offer plenty of test cases. :)
I think, your CI checks fail because of the Python version in travis...
And I'm not sure what Unicode bugs you are referring to..
Now I see:
I tried to resume a previous download session and the loadConfig failed all the time - I couldn't figure out why until I did this:
@@ -1395,12 +1397,12 @@ def domain2prefix(config={}, session=None):
def loadConfig(config={}, configfilename=''):
""" Load config file """
- try:
- with open('%s/%s' % (config['path'], configfilename), 'r') as infile:
- config = pickle.load(infile)
- except:
- print ('There is no config file. we can\'t resume. Start a new dump.')
- sys.exit()
+ # try:
+ with open('%s/%s' % (config['path'], configfilename), 'r') as infile:
+ config = pickle.load(infile)
+ # except:
+ # print ('There is no config file. we can\'t resume. Start a new dump.')
+ # sys.exit()
return config
And I got this error:
Traceback (most recent call last):
File "./dumpgenerator.py", line 2359, in <module>
main()
File "./dumpgenerator.py", line 2343, in main
config = loadConfig(config=config, configfilename=configfilename)
File "./dumpgenerator.py", line 1402, in loadConfig
config = pickle.load(infile)
File "/nix/store/swy0p01xr0wyh907d67hkxr1g0kngcpn-python3-3.7.4/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I took me a while to trace it down, naturally because a "catch all" except statement was used and so the error message wasn't clear - the file was there. See this QA.
This QA says to use rb instead of r...
Doron Behar, 15/10/19 19:01:
See this QA
Yes, clearly it's not ideal to catch all exceptions. It's just one of many hacky shortcuts taken to be able to finish running dumpgenerator on tens of thousands of wikis (https://archive.org/details/wikiteam). We need help to fix, and most importantly test, the underlying issues on thousands of wikis.
I've started testing this, but it's a can of worms. We need to test various kinds of inputs, but a lot of failures are surfaced even with a single wiki, with a single launch or XML/image resumption attempt. Also, wikitools and reverse_readlines don't like python3, while pickle doesn't like strings. Hmpf.
I'm using Python 3.7.6, by the way.
And yes, there are some files which need to be opened in binary mode for the way this was written, plus there are some errors of concatenation of bytes with non-bytes. I'm not entirely sure what was your intention.
On the other hand, this rather simplistic change mostly works for me: https://github.com/nemobis/wikiteam/commit/bcecfa224d089467be4c6ee0e61108269e45c0d0
see also https://github.com/mediawiki-client-tools/mediawiki-scraper
via https://wiki.archiveteam.org/index.php?title=WikiTeam