cmdstanpy icon indicating copy to clipboard operation
cmdstanpy copied to clipboard

Support Non-UTF-8 terminal

Open akirayou opened this issue 3 years ago • 9 comments

Summary:

Install_cmdstan and so on could not run on a non-UTF-8 terminal. For example, Japanese/Chinese/Korean windows.

Description:

The terminal encoding is normally sys.stdin.encoding. But CmdnStanPy hardcords 'utf-8'. Here is the workaround. This workaround use CMDSTAN_ENCODING Environment value to override the terminal encoding. https://github.com/akirayou/cmdstanpy/commit/db1431c13ee5ab35e0c440d686ed1eb79e0815e3

Additional Information:

To test the above patch, you have to modify the test code like this. Because Windows filesystem is a case insensitive filesystem. https://github.com/akirayou/cmdstanpy/commit/883ac4531fa3557a601e3387aa9d5b705610d7e3

Current Version:

https://github.com/stan-dev/cmdstanpy/commit/ecd844e9945ecf153be21ab69112d71862d2e0be

akirayou avatar May 22 '21 05:05 akirayou

Here is the error message. WARNING:cmdstanpy:CmdStan installation failed. Traceback (most recent call last): File "", line 1, in File "C:\Users\akira\anaconda3\envs\cmdstanpy_test\lib\site-packages\cmdstanpy\utils.py", line 986, in install_cmdstan logger.warning(stderr.decode('utf-8').strip()) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 839: invalid start byte

akirayou avatar May 22 '21 05:05 akirayou

What if we would use chardet library to detect encoding?

ahartikainen avatar May 22 '21 06:05 ahartikainen

Encoding estimation is just an estimation. I do not recommend it.

For example "魍魎"(CP932/Japan : means 'goblins') == "敼斢"(CP950/Taiwan:means 'play music yellow') ==0xE9B1E9B2 .

Estimating encoding is not easy in a short sentence. So I think user-defined encoding or using system encoding is better. And also, I don't want to add unnecessary dependencies.

akirayou avatar May 22 '21 07:05 akirayou

We could probably go with system encoding + error handling (.decode(..., errors="replace")?

ENV var is one option.

ahartikainen avatar May 22 '21 07:05 ahartikainen

In my environment, ENV is needed (Windows). python 3.8.4 on win10 ignores the encoding setting for python. If there is another method to get terminal encoding, It's ok but I don't know.

C:\Users\akira>set PYTHONIOENCODING=cp932
C:\Users\akira>python
Python 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'utf-8'

sys.stdin.encoding is 'utf-8' but ,I need 'cp932'.

In Ubuntu , sys.stdin.encoding is usable with PYTHONIOENCODING Environment value.

$ PYTHONIOENCODING=cp932 python
Python 3.8.3 (default, Jul  2 2020, 16:21:59)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
 'cp932'

akirayou avatar May 22 '21 08:05 akirayou

Sorry, I solved it myself.

https://github.com/akirayou/cmdstanpy/commit/f6c7191de7f3ce7b2ecc5c160ac279fdc3bc8ec0?branch=f6c7191de7f3ce7b2ecc5c160ac279fdc3bc8ec0&diff=unified

TERMINRL_ENCODING=sys.stdin.encoding # You can set this by PYTHONIOENCODING Environment
if platform.system()=="Windows": # In case of windows, you have to get terminal encoding with Win32API
    import win32console
    #https://docs.python.org/3/library/codecs.html#standard-encodings
    cp_to_enc={37:"cp037",273:"cp273",424:"cp424",437:"cp437",500:"cp500",720:"cp720",
        737:"cp737",775:"cp775",850:"cp850",852:"cp852",855:"cp855",856:"cp856",
        857:"cp857",858:"cp858",860:"cp860",861:"cp861",862:"cp862",863:"cp863",
        864:"cp864",865:"cp865",866:"cp866",869:"cp869",874:"cp874",875:"cp875",
        932:"cp932",949:"cp949",950:"cp950",1006:"cp1006",1026:"cp1026",1125:"cp1125",
        1140:"cp1140",1250:"cp1250",1251:"cp1251",1252:"cp1252",1253:"cp1253",
        1254:"cp1254",1255:"cp1255",1256:"cp1256",1257:"cp1257",1258:"cp1258",
        936:"gbk",819:"latin_1",1361:"johab",154:"ptcp154",65001:"utf-8",20127:"ascii",
        }
    if win32console.GetConsoleCP() in cp_to_enc:
        TERMINAL_ENCODING=cp_to_enc[win32console.GetConsoleCP()]  

akirayou avatar May 22 '21 09:05 akirayou

making sure I understand what's going on here: is the fix is to get your environment variables set correctly? if so, we should explain and document this for others in the CmdStanPy install docs

mitzimorris avatar May 23 '21 18:05 mitzimorris

In the previous messages, I proposed 2 types of solutions, one is with ENV, another is without ENV. Both solutions are OK for me. Which one do you like?

1.With ENV: (set by CMDSTAN_ENCODING / need documentation for windows users) Implementation example: https://github.com/akirayou/cmdstanpy/blob/db1431c13ee5ab35e0c440d686ed1eb79e0815e3/cmdstanpy/utils.py#L32-L34

Documentation example:

For non-English Windows users. You have to set the OS encoding (code page) via CMDSTAN_ENCODING environment variable. For example CMDSTAN_ENCODING=cp932 in Japan.
All configurable encodings are listed in  https://docs.python.org/3/library/codecs.html#standard-encodings .

2.Without ENV:(get OS encoding via win32API / don't need documentation [detect encoding automatically]) Implementation example: https://github.com/akirayou/cmdstanpy/blob/bugifx-candidate/terminal_encoding/cmdstanpy/utils.py#L33-L47

Documentation example: None

akirayou avatar May 24 '21 00:05 akirayou

Are you using Powershell or CMD.exe? In Powershell, you should not be using 'set'

C:\Users\akira> $Env:PYTHONIOENCODING=cp932
C:\Users\akira>python

This may also be relevant:

Changed in version 3.6: On Windows, the encoding specified by this variable is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSSTDIO is also specified. Files and pipes redirected through the standard streams are not affected.

Does this second environment variable work?

WardBrian avatar Sep 01 '21 18:09 WardBrian