typedbytes icon indicating copy to clipboard operation
typedbytes copied to clipboard

Unicode/Bytes Handling

Open bwhite opened this issue 13 years ago • 4 comments

I'm the author of Hadoopy and I had a question about your Unicode/Bytes mappings. I've tried to keep my typedbytes implementation byte compatible with yours and produce the same python-side semantics; however, recently I noticed that the current mapping of Python strings can 1.) cause problems with binary data and 2.) has counter-intuitive behavior with unicode strings. I am tempted to change this and wanted to see if I am missing something or if there is a clean solution that maintains compatibility.

The main issue is that unicode is mapped to a string (type code 7) but when it is parsed it comes back as a string when it would make sense to utf-8 decode it and return a unicode object. This is because code #7 is defined to be UTF-8 bytes http://hadoop.apache.org/mapreduce/docs/r0.22.0/api/index.html?org/apache/hadoop/typedbytes/package-summary.html. However, as strings are also mapped to type code 7, there are some strings that may contain arbitrary values (non-utf8) which is presumably why you don't do the decoding (https://github.com/klbostee/typedbytes/blob/master/typedbytes.py#L145 isn't used). You have a Bytes class to differentiate but I don't think this is necessary.

Current problems

  1. Unable to distinguish between unicode and strings, input/output shouldn't change what the user sees.
  2. Bytes class will be unnecessary in Python 3 and will cause more confusion as the string/bytes distinction will be obviously wrong where now it just silently converts unicode to strings.

My proposed solution is

  1. Make python strings map to type code 0 as they are not necessarily utf-8 (which is the source of the problem).
  2. Make unicode map to type code 7, which means that it can be decoded properly.
  3. Make a conversion utility to convert old data from typecode 7 to 0. If there is a utf-8 decoding error it could say that it may be due to this change and provide steps for fixing it. In this conversion it'd be possible for some unicode to be decoded as strings; however, this simply provides the current semantics (in the worst case).

Questions

  1. Will this proposed solution work on the Java side? Does java perform UTF-8 decoding of Type 7's (I haven't had a chance to look)?

I have been meaning to fix this but hesitant to do it on our side and break compatibility. Also since you've surely run into this you probably have an opinion on it.

bwhite avatar Jan 05 '12 23:01 bwhite

I switched my python strings to typecode 0 which will still keep compatibility but I haven't forced typecode 7 to be utf-8 (not automatically decoded) which could raise errors.

bwhite avatar Jan 05 '12 23:01 bwhite

Hey Brandyn,

It would indeed be more elegant to use unicode strings for typed bytes strings (in which case we could use plain strings for the bytes type), but we tried this before and then decided against it because it had a substantial impact on performance.

Strings are very common on Hadoop so, although I'd be very happy to switch to unicode in theory, I'm not keen at all to switch to unicode unless someone finds a way to adapt ctypedbytes (at least, and preferably also typedbytes of course) so that it works at similar speeds with unicode instead...

Sorry for the late reply, -Klaas

On Fri, Jan 6, 2012 at 12:31 AM, Brandyn White < [email protected]

wrote:

I'm the author of Hadoopy and I had a question about your Unicode/Bytes mappings. I've tried to keep my typedbytes implementation byte compatible with yours and produce the same python-side semantics; however, recently I noticed that the current mapping of Python strings can 1.) cause problems with binary data and 2.) has counter-intuitive behavior with unicode strings. I am tempted to change this and wanted to see if I am missing something or if there is a clean solution that maintains compatibility.

The main issue is that unicode is mapped to a string (type code 7) but when it is parsed it comes back as a string when it would make sense to utf-8 decode it and return a unicode object. This is because code #7 is defined to be UTF-8 bytes http://hadoop.apache.org/mapreduce/docs/r0.22.0/api/index.html?org/apache/hadoop/typedbytes/package-summary.html. However, as strings are also mapped to type code 7, there are some strings that may contain arbitrary values (non-utf8) which is presumably why you don't do the decoding ( https://github.com/klbostee/typedbytes/blob/master/typedbytes.py#L145isn't used). You have a Bytes class to differentiate but I don't think this is necessary.

Current problems

  1. Unable to distinguish between unicode and strings, input/output shouldn't change what the user sees.
  2. Bytes class will be unnecessary in Python 3 and will cause more confusion as the string/bytes distinction will be obviously wrong where now it just silently converts unicode to strings.

My proposed solution is

  1. Make python strings map to type code 0 as they are not necessarily utf-8 (which is the source of the problem).
  2. Make unicode map to type code 7, which means that it can be decoded properly.
  3. Make a conversion utility to convert old data from typecode 7 to 0. If there is a utf-8 decoding error it could say that it may be due to this change and provide steps for fixing it. In this conversion it'd be possible for some unicode to be decoded as strings; however, this simply provides the current semantics (in the worst case).

Questions

  1. Will this proposed solution work on the Java side? Does java perform UTF-8 decoding of Type 7's (I haven't had a chance to look)?

I have been meaning to fix this but hesitant to do it on our side and break compatibility. Also since you've surely run into this you probably have an opinion on it.


Reply to this email directly or view it on GitHub: https://github.com/klbostee/typedbytes/issues/4

klbostee avatar Jan 07 '12 16:01 klbostee

Do you recall what operations were slow using unicode? Was it decoding utf-8 or general string operations?

bwhite avatar Jan 11 '12 02:01 bwhite

Mostly the decoding I think, but could be both I guess...

-K

Sent from my iPad

On 11 Jan 2012, at 03:53, Brandyn White [email protected] wrote:

Do you recall what operations were slow using unicode? Was it decoding utf-8 or general string operations?


Reply to this email directly or view it on GitHub: https://github.com/klbostee/typedbytes/issues/4#issuecomment-3441290

klbostee avatar Jan 11 '12 08:01 klbostee