datrie UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: code point not in range(0x110000)

UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: code point not in range(0x110000)

Reported while trying to build datrie on ppc64 architecture (Big Endian) on openSUSE as per (1) (there is no failure for ppc64le that is Little Endian)

(1) https://build.opensuse.org/package/live_build_log/openSUSE:Factory:PowerPC/python-datrie/standard/ppc64

=== extract 
[   43s] __________________________________ test_keys ________
[   43s]
[   43s]     def test_keys():
[   43s]         trie = _trie()
[   43s]         state = datrie.State(trie)
[   43s]         it = datrie.Iterator(state)
[   43s]
[   43s]         keys = []
[   43s]         while it.next():
[   43s] >           keys.append(it.key())
[   43s]
[   43s] tests/test_iteration.py:85:
[   43s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[   43s] src/datrie.pyx:942: in datrie._TrieIterator.key (src/datrie.c:17947)
[   43s]     cpdef unicode key(self):
[   43s] src/datrie.pyx:945: in datrie._TrieIterator.key (src/datrie.c:17845)
[   43s]     return unicode_from_alpha_char(key)
[   43s] src/datrie.pyx:1111: in datrie.unicode_from_alpha_char (src/datrie.c:19975)
[   43s]     return c_str[:length].decode('utf_32_le')
[   43s] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[   43s]
[   43s] input = <read-only buffer ptr 0x100070a81b0, size 16 at 0x3fffa166fb30>
[   43s] errors = 'strict'
[   43s]
[   43s]     def decode(input, errors='strict'):
[   43s] >       return codecs.utf_32_le_decode(input, errors, True)
[   43s] E       UnicodeDecodeError: 'utf32' codec can't decode bytes in position 0-3: code point not in range(0x110000)
[   43s]
[   43s] /usr/lib64/python2.7/encodings/utf_32_le.py:11: UnicodeDecodeError
===

Jun 29 '17 10:06 michelmno

Hello,

we are seeing the same issue in Debian when bulding on big endian platforms (failing on mips, ppc, sparc, etc.): https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=897094 https://buildd.debian.org/status/package.php?p=python-datrie&suite=sid

Apr 28 '18 19:04 fpytloun

Hey!

Same issue in Fedora in the s390x architecture (big-endian platform).

A simple fix would be to identify whether the machine is little-endian or big-endian, and then decoding the string appropriately then. Something like:

import sys

if sys.byteorder == "little":
    return c_str[:length].decode('utf_32_le')
else:
    return c_str[:length].decode('utf_32_be')

EDIT: It works fine, when using this fix: https://koji.fedoraproject.org/koji/taskinfo?taskID=54903600

Nov 04 '20 08:11 Aniket-Pradhan