yamdwe icon indicating copy to clipboard operation
yamdwe copied to clipboard

latin-1 character encoding error

Open Bgs4269 opened this issue 9 years ago • 13 comments

After upgrading my source (see #31) I was able to import most of the mediawiki, but got stuck here:

Traceback (most recent call last): File "./yamdwe.py", line 93, in main() File "./yamdwe.py", line 61, in main exporter.write_pages(pages) File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 41, in write_pages self._convert_page(page) File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 80, in _convert_page (len(page["revisions"]), page['title'])) UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0151' in position 46: ordinal not in range(256)

  1. Any way to find the "offending character" on the source page?
  2. Is this a source error or a yamdwe error?

Bgs4269 avatar Jul 31 '15 09:07 Bgs4269

0151 should be a simple 'i'

Bgs4269 avatar Jul 31 '15 09:07 Bgs4269

I think the character it's failing on is ő, found somewhere in a page name. Python 2 & Unicode are a bit weird, the \u escape actually takes hex digits so it's UTF-16 0x0151.

Can you try the latest update and see what if it works OK now?

I really wish mwlib supported Python 3, Unicode is a second class citizen in Python 2, I don't know all of the weird "gotchas" so these bugs keep cropping up!

projectgus avatar Aug 02 '15 03:08 projectgus

If it still doesn't work, can you please tell me what version of python you're using? python -V will output it.

projectgus avatar Aug 02 '15 03:08 projectgus

Pulled from git. We are getting closer! :)

Traceback (most recent call last): File "./yamdwe.py", line 93, in main() File "./yamdwe.py", line 61, in main exporter.write_pages(pages) File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 41, in write_pages self._convert_page(page) File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 80, in _convert_page (len(page["revisions"]), page['title'].encode("utf-8"))) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

In reality they are utf-8 chars. For example: c3 b6 -> ö c3 a1 -> á

Looks like it gets the info of utf-8 encoding, but tries to handle it as ascii.

Bgs4269 avatar Aug 03 '15 11:08 Bgs4269

Hmm, actually I think I've translated the bug - before it was failing on UTF-16, now it's converting to UTF-8 and then failing on that.

What Python version are you running on? I put the steps to find out in a previous comment.

Asking because if I run Python 2.7 on my machine and type into the interactive prompt:

print "%s" % (u"\u0151")
print "%s" % (u"\u0151".encode("utf-8"))

... they both print ö correctly, so I'm a bit lost about why essentially the same code in yamdwe is raising an exception.

projectgus avatar Aug 03 '15 22:08 projectgus

My python versrion is 2.7.10.

I ran the two prints and got different results:

Python 2.7.10 (default, Jun 30 2015, 17:20:49) [GCC 4.9.2] on linux2 Type "help", "copyright", "credits" or "license" for more information.

print "%s" % (u"\u0151") Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0151' in position 0: ordinal not in range(256) print "%s" % (u"\u0151".encode("utf-8")) ő

Bgs4269 avatar Aug 04 '15 07:08 Bgs4269

Oops, I said "Hopefully fixes" in the commit and it auto-closed the issue!

Can you try that? I think I have it figured out now, I managed to reproduce the same behaviour by setting my locale to iso88591 instead of utf-8. It should work now, please let me know.

projectgus avatar Aug 04 '15 23:08 projectgus

This time the conversion got MUCH further! (That is, it solved a lot of problems...)

It bumped into another character though that according to the good result, was surprising to me:

Traceback (most recent call last): File "./yamdwe.py", line 89, in main() File "./yamdwe.py", line 62, in main exporter.write_pages(pages) File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 41, in write_pages self._convert_page(page) File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 115, in _convert_page with codecs.open(changespath, "w" if is_first else "a", "utf-8") as f: File "/home/bgs/.virtualenvs/yamdwe/lib64/python2.7/codecs.py", line 884, in open file = builtin.open(filename, mode, buffering) UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0171' in position 85: ordinal not in range(256)

That's an "ű" from the word "ADATGYŰJTÉS". (It's a small page and the only occurrence.)

Do you think that changing the locale for the time of the conversion would help?

Bgs4269 avatar Aug 05 '15 08:08 Bgs4269

Sorry for the delay in looking at this.

Do you think that changing the locale for the time of the conversion would help?

Yes, if you set your locale to utf-8 instead of iso8859-1 then these problems should probably go away.

However, it'd be great to have yamdwe work cleanly even with non-Unicode locales. If you don't mind, could you please try the latest revision with your current locale and see if it succeeds?

Angus

projectgus avatar Aug 13 '15 01:08 projectgus

I will try the local change to do the actual work, but I'm open to test this issue all the way to make yamdwe better :)

After pulling from git, this is what I get:

Traceback (most recent call last):
  File "./yamdwe.py", line 89, in <module>
    main()
  File "./yamdwe.py", line 62, in main
    exporter.write_pages(pages)
  File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 41, in write_pages
    self._convert_page(page)
  File "/home/bgs/wiki/yamwde/yamdwe/dokuwiki.py", line 97, in _convert_page
    content = wikicontent.convert_pagecontent(full_title, revision["*"])
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 68, in convert_pagecontent
    result = convert(root, context, False)
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 90, in convert
    return convert_children(node, context)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 120, in convert
    return result + convert_children(section, context)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 120, in convert
    return result + convert_children(section, context)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 281, in convert
    return convert_children(node, context)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 204, in convert
    converted_list = convert_children(itemlist, context)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 210, in convert
    item_content = convert_children(item, context)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 80, in convert_children
    res = convert(child, context, result.endswith("\n"))
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 142, in __call__
    return self.call_internal(lambda f:f, args, kw)
  File "/home/bgs/wiki/yamwde/yamdwe/visitor.py", line 165, in call_internal
    result = func_modifier(self.registry[t])(*args, **kw)
  File "/home/bgs/wiki/yamwde/yamdwe/wikicontent.py", line 180, in convert
    return "[[%s]]" % pagename
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 15: ordinal not in range(128)

Bgs4269 avatar Aug 13 '15 12:08 Bgs4269

I tried with local set to en_US.UTF-8 and got stuck on the same page, but different character:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range(128)

Bgs4269 avatar Aug 13 '15 13:08 Bgs4269

Added the following at the all the imports, at the beginning of wikicontent.py and it worked. Hope it helps fix the bug:

import sys  
reload(sys)  
sys.setdefaultencoding('utf8')

alfredocambera avatar Feb 29 '16 22:02 alfredocambera

@alfredocambera thank you, it worked! (cc @projectgus)

miniBill avatar Aug 04 '20 09:08 miniBill