xapian-fu icon indicating copy to clipboard operation
xapian-fu copied to clipboard

Text encoding

Open singpolyma opened this issue 14 years ago • 4 comments
trafficstars

Since all text in xapian is utf-8, strings coming back out of xapian-fu should be encoded in utf-8 (probably just by calling force_encoding('utf-8') on strings as they come out)

Right now the strings come out marked as local encoding, but are actually utf-8, and this causes some problems.

singpolyma avatar Jul 09 '11 21:07 singpolyma

What if you set Encoding.default_external?

djanowski avatar Aug 12 '11 02:08 djanowski

Sure, I can get around it, but the point is that since all of the data is always in fact going to be UTF-8, the library should honour that.

singpolyma avatar Aug 16 '11 00:08 singpolyma

I guess that's right, as long as Xapian always stores/returns UTF-8.

What should we do when storing? Should an exception be raised if the string is not UTF-8?

djanowski avatar Aug 16 '11 15:08 djanowski

I'm not sure how the Xapian bindings handle things, but if they just use the raw bytestream and assume it's UTF-8 (because, yes, Xapian alwas stores/returns in UTF-8) then you should probably call .encode('utf-8') and if there's a problem ruby will throw the exception for you :)

singpolyma avatar Aug 16 '11 20:08 singpolyma