wikiteam
wikiteam copied to clipboard
UnicodeWarning and UnicodeEncodeError issues
Simple incompatibility between old image list and current master, or something more?
Resuming download, using directory eswikiarquitecturacom-20140628-wikidump [...] You didn't provide a path for index.php, we try this one: http://es.wikiarquitectura.com/index.php Checking api.php... http://es.wikiarquitectura.com/api.php api.php is OK Checking index.php... http://es.wikiarquitectura.com/index.php index.php is OK Analysing http://es.wikiarquitectura.com/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session XML dump was completed in the previous session Image list was completed in the previous session ./dumpgenerator.py:1232: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if filename2 not in listdir:
Now it reads the image list file as unicode, and it is comparing with os.listdir() which is returning not unicode. I don't think it is serious, but I can check it tomorrow.
Ok. The dump is proceeding, I'll check at the end if some image is missing. (Update: I forgot to count them, there is a big dump at https://archive.org/details/wiki-eswikiarquitecturacom though.)
Some more despite https://github.com/WikiTeam/wikiteam/pull/124 , on wikihow.com with latest master:
Downloaded 30 pages
"Hit" Someone on Pandanda, 0 edits
"Hog Flip" in Halo, 0 edits
File "dumpgenerator.py", line 1503, in
Can you reproduce this error still? The one you mentioned in the last comment has already been fixed. Not sure about the original one.
Can't reproduce now either. Though the original comment might have been about an image list produced with one version of dumpgenerator and then used with another, incompatible one.
federico@lakka:~/siilo/wikiteam/wikiteam$ python dumpgenerator.py --api=http://es.wikiarquitectura.com/api.php --xml --namespaces=8 --images
Checking API... http://es.wikiarquitectura.com/api.php
API is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3) #
# More info at: https://github.com/WikiTeam/wikiteam #
#########################################################################
#########################################################################
# Copyright (C) 2011-2014 WikiTeam #
# This program is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# This program is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details. #
# #
# You should have received a copy of the GNU General Public License #
# along with this program. If not, see <http://www.gnu.org/licenses/>. #
#########################################################################
Analysing http://es.wikiarquitectura.com/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = 8
Excluding titles from namespaces = None
1 namespaces found
Retrieving titles in the namespace 8
. 5 titles retrieved in the namespace 8
5 page titles loaded
Titles saved at... eswikiarquitecturacom-20140919-titles.txt
Retrieving the XML for every page from "start"
MediaWiki:Common.css, 8 edits
MediaWiki:Mainpage, 1 edit
MediaWiki:Newarticletext, 1 edit
MediaWiki:Sidebar, 1 edit
MediaWiki:Sitenotice, 1 edit
XML dump saved at... eswikiarquitecturacom-20140919-history.xml
Retrieving image filenames
.................................................................... Found 33592 images
33592 image names loaded
Image filenames and URLs saved at... eswikiarquitecturacom-20140919-images.txt
Retrieving images from "start"
Creating "./eswikiarquitecturacom-20140919-wikidump/images" directory
Downloaded 10 images
^CTraceback (most recent call last):
File "dumpgenerator.py", line 1602, in <module>
main()
File "dumpgenerator.py", line 1594, in main
createNewDump(config=config, other=other)
File "dumpgenerator.py", line 1288, in createNewDump
generateImageDump(config=config, other=other, images=images, session=other['session'])
File "dumpgenerator.py", line 869, in generateImageDump
filename), session=session) # use Image: for backwards compatibility
File "dumpgenerator.py", line 377, in getXMLFileDesc
return getXMLPage(config=config, title=title, verbose=False, session=session)
File "dumpgenerator.py", line 472, in getXMLPage
xml = getXMLPageCore(params=params, config=config, session=session)
File "dumpgenerator.py", line 440, in getXMLPageCore
r = session.post(url=config['index'], data=params, headers=headers)
File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 498, in post
return self.request('POST', url, data=data, **kwargs)
File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
resp = self.send(prep, **send_kwargs)
File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 559, in send
r = adapter.send(request, **kwargs)
File "/home/users/federico/.local/lib/python2.7/site-packages/requests/adapters.py", line 327, in send
timeout=timeout
File "/home/users/federico/.local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 493, in urlopen
body=body, headers=headers)
File "/home/users/federico/.local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 319, in _make_request
httplib_response = conn.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.7/socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
federico@lakka:~/siilo/wikiteam/wikiteam$ python dumpgenerator.py --api=http://es.wikiarquitectura.com/api.php --xml --namespaces=8 --images
Checking API... http://es.wikiarquitectura.com/api.php
API is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3) #
# More info at: https://github.com/WikiTeam/wikiteam #
#########################################################################
#########################################################################
# Copyright (C) 2011-2014 WikiTeam #
# This program is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# This program is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details. #
# #
# You should have received a copy of the GNU General Public License #
# along with this program. If not, see <http://www.gnu.org/licenses/>. #
#########################################################################
Analysing http://es.wikiarquitectura.com/api.php
Warning!: "./eswikiarquitecturacom-20140919-wikidump" path exists
There is a dump in "./eswikiarquitecturacom-20140919-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./eswikiarquitecturacom-20140919-wikidump/config.txt" will be loaded.
Do you want to resume ([yes, y], [no, n])? y
You have selected: YES
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
17 images were found in the directory from a previous session
Retrieving images from "00 centro kimmel.jpg"
Downloaded 10 images
Analysing http://africanspecies.net/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "불활성화 백신" Retrieving the XML for every page from "불활성화 백신" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... africanspeciesnet-20141127-history.xml Image list is incomplete. Reloading... Retrieving image filenames . Found 337 images
Analysing http://africanspecies.net/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "불활성화 백신" Retrieving the XML for every page from "불활성화 백신" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... africanspeciesnet-20141127-history.xml Image list is incomplete. Reloading... Retrieving image filenames . Found 337 images
I'm also wondering whether resume works... it would be terrible if the bug makes us "close" incomplete dumps.
Analysing http://wiki.megatec.ru/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "Мастер-Web:Установка версии 7.2" Retrieving the XML for every page from "Мастер-Web:Установка версии 7.2" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... wikimegatecru-20141203-history.xml Image list is incomplete. Reloading... Retrieving image filenames ........ Found 3722 images
Sorry if this is bad etiquette (I'm new), but I was wondering if there was any update on this? Getting UnicodeEncodeError
whenever I run python dumpgenerator.py --api=http://ark.gamepedia.com/api.php --xml --curonly --images --delay 5 --resume --path=arkgamepediacom-20150717-wikidump/
, I get the following results:
Analysing http://ark.gamepedia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
195 images were found in the directory from a previous session
Retrieving images from "Campfire.png"
Sleeping... 5 seconds...
Sleeping... 5 seconds...
Sleeping... 5 seconds...
Traceback (most recent call last):
File "dumpgenerator.py", line 2031, in <module>
main()
File "dumpgenerator.py", line 2021, in main
resumePreviousDump(config=config, other=other)
File "dumpgenerator.py", line 1745, in resumePreviousDump
session=other['session'])
File "dumpgenerator.py", line 1071, in generateImageDump
imagefile = open(filename3, 'wb')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 53: ordinal not in range(128)
I'm using the most recent dumpgenerator.py as of this writing.
Hello DrDevice. This bug still need a fix. A workaround: You can remove the image filename in the -images.txt file in the dump directory, and then resume. According to that wiki, it is "Capture d'écran 2015-06-13 11.20.59.png". If you find more errors, remove them too, but I don't see more weird chars in the list.
http://ark.gamepedia.com/index.php?title=Special%3APrefixIndex&prefix=&namespace=6
emijrp, thank you very much! That seems to have cleared it up! It's been trucking on for a couple hours now, no errors. Crossing my fingers! :)
This is still an issue. I've tried patches from #279, didn't help.
I recently ran into the same issue with a similar message but for another part of the script.
The decode statement at https://github.com/WikiTeam/wikiteam/blob/master/dumpgenerator.py#L1999 was causing an exception, which had the script consider the image folder wasn't found and forced a dump resume to re-download all the images for no good reason. This line should probably be modified to distinguish non-existing dir from some other exception.
Anyways, the exception thrown was:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xxxx' in position YY: ordinal not in range(128)
And it turns out it was due to the fact that the Python 2.7 script used 'ascii' as a default encoding for the sys module as shown by python -c 'import sys; print(sys.getdefaultencoding())'
This was fixed by modifying /usr/lib/python2.7/sitecustomize.py
to add the following lines that force utf8 default encoding in the Python 2.7 environment.
import sys
sys.setdefaultencoding('UTF8')
@ouaibe Thanks for the tip, I thought it must've been a bug in wikiteam. They should be able to set this somewhere theirselves right?
I'd like to pile on and say that I've also stumbled upon this issue or a similar one:
$ python ../wikidump/wikiteam/dumpgenerator.py "https://minecraft-de.gamepedia.com/" --xml --images
[...]
Downloaded 5600 images
Downloaded 5610 images
Downloaded 5620 images
Traceback (most recent call last):
File "../wikidump/wikiteam/dumpgenerator.py", line 2323, in <module>
main()
File "../wikidump/wikiteam/dumpgenerator.py", line 2313, in main
resumePreviousDump(config=config, other=other)
File "../wikidump/wikiteam/dumpgenerator.py", line 2030, in resumePreviousDump
session=other['session'])
File "../wikidump/wikiteam/dumpgenerator.py", line 1318, in generateImageDump
text=u'The page "%s" was missing in the wiki (probably deleted)' % (title.decode('utf-8'))
File "/home/wlhlm/vault/share/mc/wikidump/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 13: ordinal not in range(128)
Trying to resume, I'm hitting #250, meaning that dumpgenerator.py
fails to detect previously downloaded images and starts from the beginning:
$ python ../wikidump/wikiteam/dumpgenerator.py "https://minecraft-de.gamepedia.com/" --xml
--images --resume --path minecraft_degamepediacom-20190825-wikidump/
Checking API... https://minecraft-de.gamepedia.com/api.php
API is OK: https://minecraft-de.gamepedia.com/api.php
Checking index.php... https://minecraft-de.gamepedia.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3) #
# More info at: https://github.com/WikiTeam/wikiteam #
#########################################################################
#########################################################################
# Copyright (C) 2011-2019 WikiTeam developers #
# This program is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# This program is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details. #
# #
# You should have received a copy of the GNU General Public License #
# along with this program. If not, see <http://www.gnu.org/licenses/>. #
#########################################################################
Analysing https://minecraft-de.gamepedia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
0 images were found in the directory from a previous session
Retrieving images from "start"
Downloaded 10 images
^C
But, of course, resuming doesn't do a whole since it will hit the same UnicodeEncodeError
again.
The workaround described by @ouaibe worked. Editing siteconfig.py
and adding sys.setdefaultencoding('UTF8')
was unproblematic, because I was working in a virtualenv, but not sure how well it'd work when the global /usr/lib/python2.7/sitecustomize.py
, since this can affect other python scripts.
Python 2.7.16 dumpgenerator.py 080b723334127e7bfff97497a9aea75c97f310d5