wikiteam icon indicating copy to clipboard operation
wikiteam copied to clipboard

UnicodeWarning and UnicodeEncodeError issues

Open nemobis opened this issue 10 years ago • 15 comments

Simple incompatibility between old image list and current master, or something more?

Resuming download, using directory eswikiarquitecturacom-20140628-wikidump [...] You didn't provide a path for index.php, we try this one: http://es.wikiarquitectura.com/index.php Checking api.php... http://es.wikiarquitectura.com/api.php api.php is OK Checking index.php... http://es.wikiarquitectura.com/index.php index.php is OK Analysing http://es.wikiarquitectura.com/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session XML dump was completed in the previous session Image list was completed in the previous session ./dumpgenerator.py:1232: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if filename2 not in listdir:

nemobis avatar Jun 30 '14 21:06 nemobis

Now it reads the image list file as unicode, and it is comparing with os.listdir() which is returning not unicode. I don't think it is serious, but I can check it tomorrow.

emijrp avatar Jun 30 '14 21:06 emijrp

Ok. The dump is proceeding, I'll check at the end if some image is missing. (Update: I forgot to count them, there is a big dump at https://archive.org/details/wiki-eswikiarquitecturacom though.)

nemobis avatar Jun 30 '14 21:06 nemobis

Some more despite https://github.com/WikiTeam/wikiteam/pull/124 , on wikihow.com with latest master:

Downloaded 30 pages "Hit" Someone on Pandanda, 0 edits "Hog Flip" in Halo, 0 edits File "dumpgenerator.py", line 1503, in main() File "dumpgenerator.py", line 1495, in main createNewDump(config=config, other=other) File "dumpgenerator.py", line 1241, in createNewDump generateXMLDump(config=config, titles=titles, session=other['session']) File "dumpgenerator.py", line 579, in generateXMLDump xml = getXMLPage(config=config, title=title, session=session) File "dumpgenerator.py", line 512, in getXMLPage print ' %s, %d edits' % (title, numberofedits) UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 119: ordinal not in range(128)

nemobis avatar Jul 05 '14 18:07 nemobis

Can you reproduce this error still? The one you mentioned in the last comment has already been fixed. Not sure about the original one.

PiRSquared17 avatar Sep 19 '14 18:09 PiRSquared17

Can't reproduce now either. Though the original comment might have been about an image list produced with one version of dumpgenerator and then used with another, incompatible one.

federico@lakka:~/siilo/wikiteam/wikiteam$ python dumpgenerator.py --api=http://es.wikiarquitectura.com/api.php --xml --namespaces=8 --images  
Checking API... http://es.wikiarquitectura.com/api.php
API is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2014 WikiTeam                                      #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://es.wikiarquitectura.com/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = 8
Excluding titles from namespaces = None
1 namespaces found
    Retrieving titles in the namespace 8
.    5 titles retrieved in the namespace 8
5 page titles loaded
Titles saved at... eswikiarquitecturacom-20140919-titles.txt
Retrieving the XML for every page from "start"
    MediaWiki:Common.css, 8 edits
    MediaWiki:Mainpage, 1 edit
    MediaWiki:Newarticletext, 1 edit
    MediaWiki:Sidebar, 1 edit
    MediaWiki:Sitenotice, 1 edit
XML dump saved at... eswikiarquitecturacom-20140919-history.xml
Retrieving image filenames
....................................................................    Found 33592 images
33592 image names loaded
Image filenames and URLs saved at... eswikiarquitecturacom-20140919-images.txt
Retrieving images from "start"
Creating "./eswikiarquitecturacom-20140919-wikidump/images" directory
    Downloaded 10 images
^CTraceback (most recent call last):
  File "dumpgenerator.py", line 1602, in <module>
    main()
  File "dumpgenerator.py", line 1594, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 1288, in createNewDump
    generateImageDump(config=config, other=other, images=images, session=other['session'])
  File "dumpgenerator.py", line 869, in generateImageDump
    filename), session=session)  # use Image: for backwards compatibility
  File "dumpgenerator.py", line 377, in getXMLFileDesc
    return getXMLPage(config=config, title=title, verbose=False, session=session)
  File "dumpgenerator.py", line 472, in getXMLPage
    xml = getXMLPageCore(params=params, config=config, session=session)
  File "dumpgenerator.py", line 440, in getXMLPageCore
    r = session.post(url=config['index'], data=params, headers=headers)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 498, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/sessions.py", line 559, in send
    r = adapter.send(request, **kwargs)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/adapters.py", line 327, in send
    timeout=timeout
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 493, in urlopen
    body=body, headers=headers)
  File "/home/users/federico/.local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 319, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
KeyboardInterrupt
federico@lakka:~/siilo/wikiteam/wikiteam$ python dumpgenerator.py --api=http://es.wikiarquitectura.com/api.php --xml --namespaces=8 --images
Checking API... http://es.wikiarquitectura.com/api.php
API is OK
Checking index.php... http://es.wikiarquitectura.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.3.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2014 WikiTeam                                      #
# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing http://es.wikiarquitectura.com/api.php

Warning!: "./eswikiarquitecturacom-20140919-wikidump" path exists
There is a dump in "./eswikiarquitecturacom-20140919-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./eswikiarquitecturacom-20140919-wikidump/config.txt" will be loaded.
Do you want to resume ([yes, y], [no, n])? y
You have selected: YES
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
17 images were found in the directory from a previous session
Retrieving images from "00 centro kimmel.jpg"
    Downloaded 10 images

nemobis avatar Sep 19 '14 19:09 nemobis

Analysing http://africanspecies.net/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "불활성화 백신" Retrieving the XML for every page from "불활성화 백신" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... africanspeciesnet-20141127-history.xml Image list is incomplete. Reloading... Retrieving image filenames . Found 337 images

nemobis avatar Dec 01 '14 12:12 nemobis

Analysing http://africanspecies.net/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "불활성화 백신" Retrieving the XML for every page from "불활성화 백신" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... africanspeciesnet-20141127-history.xml Image list is incomplete. Reloading... Retrieving image filenames . Found 337 images

nemobis avatar Dec 01 '14 12:12 nemobis

I'm also wondering whether resume works... it would be terrible if the bug makes us "close" incomplete dumps.

Analysing http://wiki.megatec.ru/api.php Loading config file... Resuming previous dump process... Title list was completed in the previous session Resuming XML dump from "Мастер-Web:Установка версии 7.2" Retrieving the XML for every page from "Мастер-Web:Установка версии 7.2" ./dumpgenerator.py:624: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal if title == start: # start downloading from start, included XML dump saved at... wikimegatecru-20141203-history.xml Image list is incomplete. Reloading... Retrieving image filenames ........ Found 3722 images

nemobis avatar Dec 04 '14 11:12 nemobis

Sorry if this is bad etiquette (I'm new), but I was wondering if there was any update on this? Getting UnicodeEncodeError whenever I run python dumpgenerator.py --api=http://ark.gamepedia.com/api.php --xml --curonly --images --delay 5 --resume --path=arkgamepediacom-20150717-wikidump/, I get the following results:

Analysing http://ark.gamepedia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
195 images were found in the directory from a previous session
Retrieving images from "Campfire.png"
Sleeping... 5 seconds...
Sleeping... 5 seconds...
Sleeping... 5 seconds...
Traceback (most recent call last):
  File "dumpgenerator.py", line 2031, in <module>
    main()
  File "dumpgenerator.py", line 2021, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1745, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 1071, in generateImageDump
    imagefile = open(filename3, 'wb')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 53: ordinal not in range(128)

I'm using the most recent dumpgenerator.py as of this writing.

DrDevice avatar Jul 18 '15 07:07 DrDevice

Hello DrDevice. This bug still need a fix. A workaround: You can remove the image filename in the -images.txt file in the dump directory, and then resume. According to that wiki, it is "Capture d'écran 2015-06-13 11.20.59.png". If you find more errors, remove them too, but I don't see more weird chars in the list.

http://ark.gamepedia.com/index.php?title=Special%3APrefixIndex&prefix=&namespace=6

emijrp avatar Jul 18 '15 07:07 emijrp

emijrp, thank you very much! That seems to have cleared it up! It's been trucking on for a couple hours now, no errors. Crossing my fingers! :)

DrDevice avatar Jul 18 '15 12:07 DrDevice

This is still an issue. I've tried patches from #279, didn't help.

burner1024 avatar Mar 20 '17 11:03 burner1024

I recently ran into the same issue with a similar message but for another part of the script.

The decode statement at https://github.com/WikiTeam/wikiteam/blob/master/dumpgenerator.py#L1999 was causing an exception, which had the script consider the image folder wasn't found and forced a dump resume to re-download all the images for no good reason. This line should probably be modified to distinguish non-existing dir from some other exception.

Anyways, the exception thrown was:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xxxx' in position YY: ordinal not in range(128)

And it turns out it was due to the fact that the Python 2.7 script used 'ascii' as a default encoding for the sys module as shown by python -c 'import sys; print(sys.getdefaultencoding())'

This was fixed by modifying /usr/lib/python2.7/sitecustomize.py to add the following lines that force utf8 default encoding in the Python 2.7 environment.

import sys sys.setdefaultencoding('UTF8')

ouaibe avatar Aug 15 '18 15:08 ouaibe

@ouaibe Thanks for the tip, I thought it must've been a bug in wikiteam. They should be able to set this somewhere theirselves right?

Slider-Whistle avatar Apr 14 '19 01:04 Slider-Whistle

I'd like to pile on and say that I've also stumbled upon this issue or a similar one:

$ python ../wikidump/wikiteam/dumpgenerator.py "https://minecraft-de.gamepedia.com/" --xml --images
[...]
    Downloaded 5600 images
    Downloaded 5610 images
    Downloaded 5620 images
Traceback (most recent call last):
  File "../wikidump/wikiteam/dumpgenerator.py", line 2323, in <module>
    main()
  File "../wikidump/wikiteam/dumpgenerator.py", line 2313, in main
    resumePreviousDump(config=config, other=other)
  File "../wikidump/wikiteam/dumpgenerator.py", line 2030, in resumePreviousDump
    session=other['session'])
  File "../wikidump/wikiteam/dumpgenerator.py", line 1318, in generateImageDump
    text=u'The page "%s" was missing in the wiki (probably deleted)' % (title.decode('utf-8'))
  File "/home/wlhlm/vault/share/mc/wikidump/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 13: ordinal not in range(128)

Trying to resume, I'm hitting #250, meaning that dumpgenerator.py fails to detect previously downloaded images and starts from the beginning:

$ python ../wikidump/wikiteam/dumpgenerator.py "https://minecraft-de.gamepedia.com/" --xml
 --images --resume --path minecraft_degamepediacom-20190825-wikidump/
Checking API... https://minecraft-de.gamepedia.com/api.php
API is OK: https://minecraft-de.gamepedia.com/api.php
Checking index.php... https://minecraft-de.gamepedia.com/index.php
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2019 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://minecraft-de.gamepedia.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
XML dump was completed in the previous session
Image list was completed in the previous session
0 images were found in the directory from a previous session
Retrieving images from "start"
    Downloaded 10 images
^C

But, of course, resuming doesn't do a whole since it will hit the same UnicodeEncodeError again.

The workaround described by @ouaibe worked. Editing siteconfig.py and adding sys.setdefaultencoding('UTF8') was unproblematic, because I was working in a virtualenv, but not sure how well it'd work when the global /usr/lib/python2.7/sitecustomize.py, since this can affect other python scripts.

Python 2.7.16 dumpgenerator.py 080b723334127e7bfff97497a9aea75c97f310d5

wlhlm avatar Aug 25 '19 14:08 wlhlm