tools-python icon indicating copy to clipboard operation
tools-python copied to clipboard

write_document() should support unicode in addition to str

Open sschuberth opened this issue 8 years ago • 7 comments

See the discussion at https://github.com/nexB/scancode-toolkit/issues/436#issuecomment-270935436.

sschuberth avatar Jan 10 '17 10:01 sschuberth

I suggest that we use unicode throughout and fail if something is not unicode and that we also run a CI test matrix covering Python 2 and 3.

pombredanne avatar Jan 10 '17 11:01 pombredanne

I think it is already supported. For instance, in examples/ write_tv.py , we are opening the file using codec module with utf-8 encoding. So even if we write out.write(u'# Documenting Information\n\n') instead of out.write('# Documenting Information\n\n') in write_document() it will work.

rtgdk avatar Mar 15 '17 13:03 rtgdk

IIRC I was running into problems with Python 3, where str is what is unicode in Python 2. But I don't remember the details anymore.

sschuberth avatar Mar 15 '17 13:03 sschuberth

@sschuberth yes, str is unicode in Python3. str is Python3 bytes in Python2. We want unicode throughout here with the right adapters to work on 2 and 3. Such as here: https://github.com/nexB/license-expression/blob/master/src/license_expression/init.py#L36

pombredanne avatar Mar 15 '17 15:03 pombredanne

@rtgdk The point here is to be unicode across the board. e.g. from __future__ import unicode_literals should be used everywhere. No more u'xxx'

pombredanne avatar Mar 15 '17 15:03 pombredanne

@pombredanne My point was that codec module is taking care of unicode. Since in the examples/write_tv.py we are using "utf-8" encoding, so even if we input a Python3 byte(Python2 str) , it is automatically encoded into unicode.

with codecs.open(target, mode='w', encoding='utf-8') as out: try: write_document(document, out)

But if the user didn't use codec.open and used open , this will not work. For that we can import unicode_literals which will convert any str to unicode in Python 2 and won't affect str in Python3. I'll open a PR for that. We can discuss the issue and improvements there.

rtgdk avatar Mar 15 '17 19:03 rtgdk

This has mostly been merged... but there is some tests that are still needed. In particular the rdf output seems to write bytes at least on Python 2 while the tv output writes happy unicode

pombredanne avatar Jul 31 '17 16:07 pombredanne

As we don't support Python 2 anymore, I believe this issue can be closed. Please speak up if it should be reopened.

armintaenzertng avatar Mar 30 '23 08:03 armintaenzertng