opencensus-python icon indicating copy to clipboard operation
opencensus-python copied to clipboard

Handle unicode and str in exporters for py2

Open guewen opened this issue 7 years ago • 7 comments

Before this commit, in py2 only bytes strings were exported and in py3 only unicode strings were exported.

I'm not sure I'm doing it right, that's at least an opening for a discussion.

Fixes #273

guewen avatar Aug 23 '18 06:08 guewen

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

googlebot avatar Aug 23 '18 06:08 googlebot

I signed it!

Hey, I signed the CLA :)

guewen avatar Aug 23 '18 07:08 guewen

CLAs look good, thanks!

googlebot avatar Aug 23 '18 07:08 googlebot

I know the tests don't pass on py3 but before refining I'd like a validation that I understood correctly the goal and I'm not heading in the wrong direction.

guewen avatar Aug 23 '18 07:08 guewen

Thanks for the PR @guewen. Using the six types to solve the problem of exporting unicode attribute values looks right to me.

In general though this problem is a big can of worms, and this PR makes it clear that we need to be more careful about internal use of the str type.

If I understand correctly: it looks like the library assumes string-valued attribute values are always strs. This is usually a safe assumption, but it means that we can't store non-ASCII characters in python 2.x. This is a problem for code that naively uses unicodes in place of strs since we'll silently fail to export these attributes.

So in python 2.x, strs are ASCII-encoded byte strings. Decoding a byte string with any valid encoding gets you a unicode... which is not a str:

>>> type(b'abc')
str

>>> type(b'abc'.decode('utf-8'))
unicode

>>> isinstance(b'abc'.decode('utf-8'), str)
False

And in python 3.x strs are effectively python 2.x's unicodes, and byte strings are demoted to bytes with no implicit encoding:

>>> type(b'abc')
bytes

>>> type(b'abc'.decode('utf-8'))
str

We have to support both versions of python, and have to support non-ASCII characters in attribute values. But the spec also says to truncate these strings to 256 bytes without specifying an encoding.

In 2/3 decoding a byte string with any valid encoding gets you a unicode/str, which is itself stored internally as unicode, using up to 4 bytes per character depending on the python implementation. Among other problems, this means that we might truncate a 265 character string down to 64 characters even if it's possible to encode it with ASCII. This is a moot point now since it doesn't look like we're actually truncating these strings, but does suggest we have to be careful making changes like this that add decode calls where byte strings would otherwise stay byte strings.

c24t avatar Jan 22 '19 23:01 c24t

Which is all to say: the direction looks good, but there may be some unintended consequences.

c24t avatar Jan 22 '19 23:01 c24t

Thanks for your detailed answer, particularly, I wasn't aware of the 256 bytes truncation (new to the subject).

guewen avatar Feb 13 '19 16:02 guewen