yamdwe
yamdwe copied to clipboard
latin-1 character encoding error
After upgrading my source (see #31) I was able to import most of the mediawiki, but got stuck here:
Traceback (most recent call last):
File "./yamdwe.py", line 93, in
- Any way to find the "offending character" on the source page?
- Is this a source error or a yamdwe error?
0151 should be a simple 'i'
I think the character it's failing on is ő, found somewhere in a page name. Python 2 & Unicode are a bit weird, the \u escape actually takes hex digits so it's UTF-16 0x0151.
Can you try the latest update and see what if it works OK now?
I really wish mwlib supported Python 3, Unicode is a second class citizen in Python 2, I don't know all of the weird "gotchas" so these bugs keep cropping up!
If it still doesn't work, can you please tell me what version of python you're using? python -V
will output it.
Pulled from git. We are getting closer! :)
Traceback (most recent call last):
File "./yamdwe.py", line 93, in
In reality they are utf-8 chars. For example: c3 b6 -> ö c3 a1 -> á
Looks like it gets the info of utf-8 encoding, but tries to handle it as ascii.
Hmm, actually I think I've translated the bug - before it was failing on UTF-16, now it's converting to UTF-8 and then failing on that.
What Python version are you running on? I put the steps to find out in a previous comment.
Asking because if I run Python 2.7 on my machine and type into the interactive prompt:
print "%s" % (u"\u0151")
print "%s" % (u"\u0151".encode("utf-8"))
... they both print ö correctly, so I'm a bit lost about why essentially the same code in yamdwe is raising an exception.
My python versrion is 2.7.10.
I ran the two prints and got different results:
Python 2.7.10 (default, Jun 30 2015, 17:20:49) [GCC 4.9.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.
print "%s" % (u"\u0151") Traceback (most recent call last): File "
", line 1, in UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0151' in position 0: ordinal not in range(256) print "%s" % (u"\u0151".encode("utf-8")) ő
Oops, I said "Hopefully fixes" in the commit and it auto-closed the issue!
Can you try that? I think I have it figured out now, I managed to reproduce the same behaviour by setting my locale to iso88591 instead of utf-8. It should work now, please let me know.
This time the conversion got MUCH further! (That is, it solved a lot of problems...)
It bumped into another character though that according to the good result, was surprising to me:
Traceback (most recent call last):
File "./yamdwe.py", line 89, in
That's an "ű" from the word "ADATGYŰJTÉS". (It's a small page and the only occurrence.)
Do you think that changing the locale for the time of the conversion would help?
Sorry for the delay in looking at this.
Do you think that changing the locale for the time of the conversion would help?
Yes, if you set your locale to utf-8 instead of iso8859-1 then these problems should probably go away.
However, it'd be great to have yamdwe work cleanly even with non-Unicode locales. If you don't mind, could you please try the latest revision with your current locale and see if it succeeds?
Angus
I will try the local change to do the actual work, but I'm open to test this issue all the way to make yamdwe better :)
After pulling from git, this is what I get:
Traceback (most recent call last):
File "./yamdwe.py", line 89, in <module>
main()
File "./yamdwe.py", line 62, in main
exporter.write_pages(pages)
File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 41, in write_pages
self._convert_page(page)
File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 97, in _convert_page
content = wikicontent.convert_pagecontent(full_title, revision["*"])
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 68, in convert_pagecontent
result = convert(root, context, False)
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 90, in convert
return convert_children(node, context)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 120, in convert
return result + convert_children(section, context)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 120, in convert
return result + convert_children(section, context)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 281, in convert
return convert_children(node, context)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 204, in convert
converted_list = convert_children(itemlist, context)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 210, in convert
item_content = convert_children(item, context)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
res = convert(child, context, result.endswith("\n"))
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
return self.call_internal(lambda f:f, args, kw)
File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
result = func_modifier(self.registry[t])(*args, **kw)
File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 180, in convert
return "[[%s]]" % pagename
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 15: ordinal not in range(128)
I tried with local set to en_US.UTF-8 and got stuck on the same page, but different character:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range(128)
Added the following at the all the imports, at the beginning of wikicontent.py and it worked. Hope it helps fix the bug:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
@alfredocambera thank you, it worked! (cc @projectgus)