wikiteam icon indicating copy to clipboard operation
wikiteam copied to clipboard

Python 3 support for dumpgenerator.py

Open TimSC opened this issue 7 years ago • 9 comments

This should add python 3 support to dumpgenerator.py without breaking python 2 behavior.

TimSC avatar Dec 03 '18 20:12 TimSC

Thanks for this patch :) It worked for me, too bad upstream is inactive...

doronbehar avatar Oct 15 '19 13:10 doronbehar

Doron Behar, 15/10/19 16:49:

Thanks for this patch :) It worked for me, too bad upstream is inactive...

Thank you for testing! Can you also test whether it makes some of the Unicode bugs better or worse? As long as the tests are broken I avoid merging things until the next time I'm actively using dumpgenerator, but the bug reports offer plenty of test cases. :)

nemobis avatar Oct 15 '19 14:10 nemobis

I think, your CI checks fail because of the Python version in travis...

And I'm not sure what Unicode bugs you are referring to..

doronbehar avatar Oct 15 '19 14:10 doronbehar

Now I see:

I tried to resume a previous download session and the loadConfig failed all the time - I couldn't figure out why until I did this:

@@ -1395,12 +1397,12 @@ def domain2prefix(config={}, session=None):
 def loadConfig(config={}, configfilename=''):
     """ Load config file """

-    try:
-        with open('%s/%s' % (config['path'], configfilename), 'r') as infile:
-            config = pickle.load(infile)
-    except:
-        print ('There is no config file. we can\'t resume. Start a new dump.')
-        sys.exit()
+    #  try:
+    with open('%s/%s' % (config['path'], configfilename), 'r') as infile:
+        config = pickle.load(infile)
+    #  except:
+        #  print ('There is no config file. we can\'t resume. Start a new dump.')
+        #  sys.exit()

     return config

And I got this error:

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2359, in <module>
    main()
  File "./dumpgenerator.py", line 2343, in main
    config = loadConfig(config=config, configfilename=configfilename)
  File "./dumpgenerator.py", line 1402, in loadConfig
    config = pickle.load(infile)
  File "/nix/store/swy0p01xr0wyh907d67hkxr1g0kngcpn-python3-3.7.4/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I took me a while to trace it down, naturally because a "catch all" except statement was used and so the error message wasn't clear - the file was there. See this QA.

doronbehar avatar Oct 15 '19 16:10 doronbehar

This QA says to use rb instead of r...

doronbehar avatar Oct 15 '19 16:10 doronbehar

Doron Behar, 15/10/19 19:01:

See this QA

Yes, clearly it's not ideal to catch all exceptions. It's just one of many hacky shortcuts taken to be able to finish running dumpgenerator on tens of thousands of wikis (https://archive.org/details/wikiteam). We need help to fix, and most importantly test, the underlying issues on thousands of wikis.

nemobis avatar Oct 15 '19 16:10 nemobis

I've started testing this, but it's a can of worms. We need to test various kinds of inputs, but a lot of failures are surfaced even with a single wiki, with a single launch or XML/image resumption attempt. Also, wikitools and reverse_readlines don't like python3, while pickle doesn't like strings. Hmpf.

I'm using Python 3.7.6, by the way.

And yes, there are some files which need to be opened in binary mode for the way this was written, plus there are some errors of concatenation of bytes with non-bytes. I'm not entirely sure what was your intention.

nemobis avatar Feb 08 '20 01:02 nemobis

On the other hand, this rather simplistic change mostly works for me: https://github.com/nemobis/wikiteam/commit/bcecfa224d089467be4c6ee0e61108269e45c0d0

nemobis avatar Feb 08 '20 01:02 nemobis

see also https://github.com/mediawiki-client-tools/mediawiki-scraper

via https://wiki.archiveteam.org/index.php?title=WikiTeam

milahu avatar Jun 30 '23 20:06 milahu