grass icon indicating copy to clipboard operation
grass copied to clipboard

[Bug] UnicodeDecodeError from v.pack or potentially other scripts that call read_command()

Open HuidaeCho opened this issue 4 years ago • 3 comments

Describe the bug I think we need a long-term solution, but for now, I'll try to explain my issues with Korean characters.

Not suprisingly, Microsoft chose to use their own proprietary charset "CP939" (a variant of EUC-KR) for Korean by default. OK, that's fine. GRASS's default charset for Korean is euc-kr (line 1415 in grass79.py). BUT, SQLite only supports UTF-8, so I have to choose either translated Korean messages or correct outputs from v.db.select, etc. by switching the codepage to "CP65501" (UTF-8) and setting OUTPUT_CHARSET=CP65001 for gettext encoding in etc/env.bat. When in CP949, aligned printng by G_*aprintf() doesn't work well, but I can read translated messages. image But this is a small inconvenience compared to the output of v.db.select because v.db.select prints UTF-8 characters into an EUC-KR terminal. image Yes, the underlined characters are broken (UTF-8 characters treated as EUC-KR). Other than alignment and SQLite outputs, everything else seems fine including the v.pack issue that I explained below.

For the above reason, I chose to use CP65001 as my default charset for GRASS (giving up text file compatibility with Windows). Now, v.db.select works great and v.info output is very clean. image

However, Python still uses CP949 as default (I believe?) and many GRASS Python scripts that invoke read_command() do not provide a means for passing encoding='utf-8' to this function. In other words, many scripts try to print EUC-KR characters into a UTF-8 console, causing my reported issue. An easy but annoying and short-sighted fix is to add encoding='utf-8' to every single read_command() call (e.g., v.pack in my screenshot below).

Anyway, my request is to add the ability to pass a desired encoding to any functions that use read_command() or other functions that may output translated messages.

To Reproduce Steps to reproduce the behavior:

  1. Change the locale setting of MS Windows to Korean
  2. Start GRASS
  3. set OUTPUT_CHARSET=CP65001
  4. cpch 65001
  5. v.pack any_vector

Expected behavior No errors.

Screenshots

image

I added encoding='utf-8' to line 199 in etc/python/grass/script/vector.py to fix this issue.

System description (please complete the following information):

  • Operating System: Windows
  • GRASS GIS version: master
  • Codepage: CP65001
  • set OUTPUT_CHARSET=CP65001

Additional context For our records, I have tried set PYTHONIOENCODING=utf8 and/or set PYTHONLEGACYWINDOWSSTDIO=yes (https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING) to no avail.

HuidaeCho avatar Feb 24 '21 03:02 HuidaeCho

GRASS's default charset for Korean is euc-kr (line 1415 in grass79.py).

Why is that? Why not UTF-8?

An easy but annoying and short-sighted fix is to add encoding='utf-8' to every single read_command() call

If we are looking for a quick fix anyway, isn't putting encoding='utf-8' somewhere deep into read_command() implementation a better route?

wenzeslaus avatar Feb 24 '21 04:02 wenzeslaus

GRASS's default charset for Korean is euc-kr (line 1415 in grass79.py).

Why is that? Why not UTF-8?

For file contents compatibility with other programs.

An easy but annoying and short-sighted fix is to add encoding='utf-8' to every single read_command() call

If we are looking for a quick fix anyway, isn't putting encoding='utf-8' somewhere deep into read_command() implementation a better route?

Yes, that would be easier if other locale users are OK with UTF-8. Maybe, I should forget about EUC-KR (CP949) and move to UTF-8 (CP65001) for Korean, well if that's possible and supported by GRASS (e.g., read_command()).

HuidaeCho avatar Feb 24 '21 04:02 HuidaeCho

See also: https://trac.osgeo.org/grass/ticket/3220

marisn avatar Feb 24 '21 06:02 marisn