simstring
simstring copied to clipboard
Python Wrapper Raises TypeError for Unicode Objects
If you pass a Unicode object to SimString the SWIG wrapper will raise a TypeError which to some may be a bit opaque. After reading up on the issues with this I agree with the SWIG design decision not to support Unicode being passed but rather require an explicitly encoded string. I am simply opening this issue (which may of course be closed immediately) so that this has been noted and other users have a point of reference in case they run into similar problems given how Unicode and str objects pretty much can be used interchangeably in Python.
From the SWIG documentation:
At this time, SWIG provides limited support for Unicode and wide-character
strings (the C wchar_t type). Some languages provide typemaps for wchar_t, but
bear in mind these might not be portable across different operating systems.
This is a delicate topic that is poorly understood by many programmers and not
implemented in a consistent manner across languages. For those scripting
languages that provide Unicode support, Unicode strings are often available in
an 8-bit representation such as UTF-8 that can be mapped to the char * type
(in which case the SWIG interface will probably work). If the program you are
wrapping uses Unicode, there is no guarantee that Unicode characters in the
target language will use the same internal representation (e.g., UCS-2 vs.
UCS-4). You may need to write some special conversion functions.
The exception raised:
Traceback (most recent call last):
File "./simstringutf.py", line <snip>, in <module>
reader.retrieve(u_str)
File "<snip>/simstring-1.0/swig/python/simstring.py", line 159, in retrieve
def retrieve(*args): return _simstring.reader_retrieve(*args)
TypeError: in method 'reader_retrieve', argument 2 of type 'char const *'
Bourne shell to generate a suitable test database on most GNU/Linux systems:
cat /usr/share/dict/words | simstring --build --unicode --database=words.db
Python code to highlight the issue:
#!/usr/bin/env python
#vim:set encoding=utf-8
from simstring import reader as simstring_reader
from simstring import cosine as simstring_cosine
DB_PATH = 'words.db'
if __name__ == '__main__':
reader = simstring_reader(DB_PATH)
reader.measure = simstring_cosine
str = '鴨'
u_str = unicode(str, encoding='utf-8')
# Will succeed
print 'Trying to query using a plain string...',
reader.retrieve(str)
print 'Done!'
# Will succeed
print 'Trying to query using a utf-8 encoded unicode string...',
reader.retrieve(u_str.encode('utf-8'))
print 'Done!'
# Will fail
print 'Trying to query using a unicode string...',
reader.retrieve(u_str)
print 'Done!'