cmdstanpy
cmdstanpy copied to clipboard
Support Non-UTF-8 terminal
Summary:
Install_cmdstan and so on could not run on a non-UTF-8 terminal. For example, Japanese/Chinese/Korean windows.
Description:
The terminal encoding is normally sys.stdin.encoding. But CmdnStanPy hardcords 'utf-8'. Here is the workaround. This workaround use CMDSTAN_ENCODING Environment value to override the terminal encoding. https://github.com/akirayou/cmdstanpy/commit/db1431c13ee5ab35e0c440d686ed1eb79e0815e3
Additional Information:
To test the above patch, you have to modify the test code like this. Because Windows filesystem is a case insensitive filesystem. https://github.com/akirayou/cmdstanpy/commit/883ac4531fa3557a601e3387aa9d5b705610d7e3
Current Version:
https://github.com/stan-dev/cmdstanpy/commit/ecd844e9945ecf153be21ab69112d71862d2e0be
Here is the error message.
WARNING:cmdstanpy:CmdStan installation failed.
Traceback (most recent call last):
File "
What if we would use chardet library to detect encoding?
Encoding estimation is just an estimation. I do not recommend it.
For example "魍魎"(CP932/Japan : means 'goblins') == "敼斢"(CP950/Taiwan:means 'play music yellow') ==0xE9B1E9B2 .
Estimating encoding is not easy in a short sentence. So I think user-defined encoding or using system encoding is better. And also, I don't want to add unnecessary dependencies.
We could probably go with system encoding + error handling (.decode(..., errors="replace")?
ENV var is one option.
In my environment, ENV is needed (Windows). python 3.8.4 on win10 ignores the encoding setting for python. If there is another method to get terminal encoding, It's ok but I don't know.
C:\Users\akira>set PYTHONIOENCODING=cp932
C:\Users\akira>python
Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'utf-8'
sys.stdin.encoding is 'utf-8' but ,I need 'cp932'.
In Ubuntu , sys.stdin.encoding is usable with PYTHONIOENCODING Environment value.
$ PYTHONIOENCODING=cp932 python
Python 3.8.3 (default, Jul 2 2020, 16:21:59)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdin.encoding
'cp932'
Sorry, I solved it myself.
https://github.com/akirayou/cmdstanpy/commit/f6c7191de7f3ce7b2ecc5c160ac279fdc3bc8ec0?branch=f6c7191de7f3ce7b2ecc5c160ac279fdc3bc8ec0&diff=unified
TERMINRL_ENCODING=sys.stdin.encoding # You can set this by PYTHONIOENCODING Environment
if platform.system()=="Windows": # In case of windows, you have to get terminal encoding with Win32API
import win32console
#https://docs.python.org/3/library/codecs.html#standard-encodings
cp_to_enc={37:"cp037",273:"cp273",424:"cp424",437:"cp437",500:"cp500",720:"cp720",
737:"cp737",775:"cp775",850:"cp850",852:"cp852",855:"cp855",856:"cp856",
857:"cp857",858:"cp858",860:"cp860",861:"cp861",862:"cp862",863:"cp863",
864:"cp864",865:"cp865",866:"cp866",869:"cp869",874:"cp874",875:"cp875",
932:"cp932",949:"cp949",950:"cp950",1006:"cp1006",1026:"cp1026",1125:"cp1125",
1140:"cp1140",1250:"cp1250",1251:"cp1251",1252:"cp1252",1253:"cp1253",
1254:"cp1254",1255:"cp1255",1256:"cp1256",1257:"cp1257",1258:"cp1258",
936:"gbk",819:"latin_1",1361:"johab",154:"ptcp154",65001:"utf-8",20127:"ascii",
}
if win32console.GetConsoleCP() in cp_to_enc:
TERMINAL_ENCODING=cp_to_enc[win32console.GetConsoleCP()]
making sure I understand what's going on here: is the fix is to get your environment variables set correctly? if so, we should explain and document this for others in the CmdStanPy install docs
In the previous messages, I proposed 2 types of solutions, one is with ENV, another is without ENV. Both solutions are OK for me. Which one do you like?
1.With ENV: (set by CMDSTAN_ENCODING / need documentation for windows users) Implementation example: https://github.com/akirayou/cmdstanpy/blob/db1431c13ee5ab35e0c440d686ed1eb79e0815e3/cmdstanpy/utils.py#L32-L34
Documentation example:
For non-English Windows users. You have to set the OS encoding (code page) via CMDSTAN_ENCODING environment variable. For example CMDSTAN_ENCODING=cp932 in Japan.
All configurable encodings are listed in https://docs.python.org/3/library/codecs.html#standard-encodings .
2.Without ENV:(get OS encoding via win32API / don't need documentation [detect encoding automatically]) Implementation example: https://github.com/akirayou/cmdstanpy/blob/bugifx-candidate/terminal_encoding/cmdstanpy/utils.py#L33-L47
Documentation example: None
Are you using Powershell or CMD.exe? In Powershell, you should not be using 'set'
C:\Users\akira> $Env:PYTHONIOENCODING=cp932
C:\Users\akira>python
This may also be relevant:
Changed in version 3.6: On Windows, the encoding specified by this variable is ignored for interactive console buffers unless PYTHONLEGACYWINDOWSSTDIO is also specified. Files and pipes redirected through the standard streams are not affected.
Does this second environment variable work?