Unicode
Opening this to collect unicode related TODOs and fixes
- [x] _unicode and _encode should be renamed - leading underscore means private --> decode, encode
- [x] decoder
- [ ] encoder
- [ ] encapsulate into bolt
- [x]
Path.tempcontains the actual code for theunicodeSafehack, breaking the entirety of WB for users with non-ASCII usernames. Uproot that and move it intounicodeSafe-> done in 5b2776f0336d3e5b4c228dc5ee6438a66839b92d - [x] once on py3, see if we can drop the
Path.unicodeSafehack -> done in 0f24a929121732a29140769832bb8fa3eb2c9354 - [ ] decoder and encoder need docs and testing -> some rough tests added in 4c1bc4e1220bdc76bf49244e478fafb08f59aeb8
- [ ] hunt down uses of
unicode()-> replace as needed withdecode()- also #460 - [ ] Process text to and from Unicode at the I/O boundaries using the ~~codecs~~ ~~io module~~ actually just
openonce we're on py3 - see: http://stackoverflow.com/a/19591815/281545 -> with the exception of record signatures and possibly cosave etc ones? -> worth thinking about more, some of bosh will probably really want to keep using raw bytes since it's so low-level and its performance matters - [ ] replace string names of encodings with the codecs constants
- [ ] avoid utf8-sig - the BOM windows thing that only serves to confuse other platforms
Tougher
- [ ] consider dropping
mbcs- it always encodes/decodes but this is probably erroneous - better blow - [ ] investigate what encodings different games (or game localized exes?) support in the various MelString fields - in particular
- [ ] author, description
- [ ] masters:
- [ ] update plugin txt reading code
- [ ] update
_refreshBadNames,bad_names, activeBad` etc
Just a small comment on this, _unicode and _encode were chosen because encode and unicode are builtins. decode is as well, unfortunately.
Thanks - I guessed so but there is no shadowing in the bolt namespace and I could not think of a better name. I haven't committed on dev still so if you can come up with another name.
More important is the encodingOrder -> see for instance:https://github.com/wrye-bash/wrye-bash/commit/56058982870530936c763f2d709aa463f44b9c5e and the link to a forums report there - still waiting on feedback but I guess mbcs should be bubbled up ? Use Path.sys_fs_enc ?
definitely keep mbcs last there. Reason is that mbcs will never fail to encode/decode, even when characters cannot be represented in it, they're replaced by ? IIRC. This is a problem because if it's actually a different codepage, the code will never even try the correct codepage because mbcs threw no errors. So mbcs as a last resort. As for Path.sys_fs_enc, no, not really either. That's good when dealing with paths, but the main reason for these _unicode and _encode functions is for working with text in various data fields in plugins, that almost never have anything to do with filenames. We're talking for example the DESC field in the TES4 record, things like that. There is no standard encoding ever set forth for that though (UTF-8 would have been nice, even UTF-16), so we kinda have to guess and check.
Hmmm thanks:
>>> s=u'汉语 / 漢語'
>>> s.encode('utf8').decode('mbcs').encode('utf8')
'\xc3\xa6\xc2\xb1\xe2\x80\xb0\xc3\xa8\xc2\xaf\xc2\xad / \xc3\xa6\xc2\xbc\xc2\xa2\xc3\xa8\xc2\xaa\xc5\xbe'
>>> s.encode('utf8').decode('utf8').encode('utf8')
'\xe6\xb1\x89\xe8\xaf\xad / \xe6\xbc\xa2\xe8\xaa\x9e'
>>> s='汉语 / 漢語'
>>> s.decode('utf8').encode('utf8').decode('utf8')
u'\u6c49\u8bed / \u6f22\u8a9e'
>>> s.decode('utf8').encode('mbcs').decode('utf8')
u'?? / ??'
>>> s.decode('utf8').encode('mbcs').decode('mbcs')
u'?? / ??'
If it does not ever fail then probably I should not commit https://github.com/wrye-bash/wrye-bash/commit/56058982870530936c763f2d709aa463f44b9c5e and let it blow and give me back __repr__
My goal is to stop having random Unicode tracebacks thrown at me - so I would ideally decode early (so I only work with utf8) and encode late (for stdout etc). But apparently there is no way to cater for this in an unified fashion (using decode encode that use the chardet library)
At least in liblo etc wrappers I would like to get rid of _enc(), _uni() - should I use encode and decode there or should I try utf and then fallback to Path.sys_fs_enc ?
I'm not 100% sure, run this by @WrinklyNinja, but I believe libbsa, the boss api, and loot api all expect filenames passed to it to be given as UTF-8 encoded strings, and returns the same. So for those (if that's the case), you could just replace _enc(str) and _uni(str) with str.encode('utf-8') and unicode(str, 'utf-8') (or str.decode('utf-8')). For CBash I honestly am not sure what it uses, so I'd leave those as is. If I ever get the time, or if someone else slightly familiar with CBash (again, @WrinklyNinja would be the best bet here), can look at the code and see what it's doing, then I could have an answer for you there.
My APIs all expect UTF-8, yes. CBash is as much a mystery to me as everyone else though.
@Utumno Woooo, I got my first unicode issue! And it's inis, again:
Traceback (most recent call last):
File "bash\bash.pyo", line 206, in main
File "bash\bash.pyo", line 271, in _main
File "bash\bash.pyo", line 392, in _detect_game
File "bash\initialization.pyo", line 159, in init_dirs
File "ConfigParser.pyo", line 305, in read
File "ConfigParser.pyo", line 512, in _read
MissingSectionHeaderError: File contains no section headers.
file: D:\SteamLibrary\steamapps\common\Skyrim Special Edition\Skyrim.ini, line: 1
'\xff\xfe[\x00H\x00A\x00V\x00O\x00K\x00]\x00\r\x00\n'
The ini in question:
Note the encoding:
>>> import chardet
>>> chardet.detect(open('Skyrim.ini').read())
{'confidence': 1.0, 'language': '', 'encoding': 'UTF-16'}
Seems like ConfigParser doesn't like that at all. So we'll have to add some detection there. I tried this:
diff --git a/Mopy/bash/initialization.py b/Mopy/bash/initialization.py
index 4ccdf4db2..c99c0fc1f 100644
--- a/Mopy/bash/initialization.py
+++ b/Mopy/bash/initialization.py
@@ -28,7 +28,7 @@
# Local - don't import anything else
from . import env
from .bass import dirs, get_ini_option
-from .bolt import GPath, Path
+from .bolt import GPath, Path, getbestencoding
from .env import get_personal_path, get_local_app_data_path
from .exception import BoltError, NonExistentDriveError
@@ -152,9 +152,12 @@ def init_dirs(bashIni_, personal, localAppData, game_info):
data_oblivion_ini = dirs['app'].join(game_info.iniFiles[0])
game_ini_path = dirs['saveBase'].join(game_info.iniFiles[0])
dirs['mods'] = dirs['app'].join(u'Data')
- if data_oblivion_ini.exists():
+ if data_oblivion_ini.isfile():
oblivionIni = ConfigParser(allow_no_value=True)
- oblivionIni.read(data_oblivion_ini.s)
+ with open(data_oblivion_ini.s, u'rb') as ins:
+ ini_enc = getbestencoding(ins.read())[0]
+ with data_oblivion_ini.open(u'r', encoding=ini_enc) as ins:
+ oblivionIni.readfp(ins)
# is bUseMyGamesDirectory set to 0?
if get_ini_option(oblivionIni, u'bUseMyGamesDirectory') == u'0':
game_ini_path = data_oblivion_ini
And it seems to work fine. But you're the unicode / encoding expert around here, so this is probably not ideal :P (needs to open the file twice for a start...).
Edit: pushed my fix in 38a2d383f87168f0d735ba861adb1cec61484fe7 if you want to take a look.
To avoid reading the file twice I would do something like
try:
with data_oblivion_ini.open(u'r', encoding=ini_enc) as ins:
oblivionIni.readfp(ins)
except UnicodeDecodeError:
with open(data_oblivion_ini.s, u'rb') as ins:
ini_enc = getbestencoding(ins.read())[0]
with data_oblivion_ini.open(u'r', encoding=ini_enc) as ins:
oblivionIni.readfp(ins)
Not the pinnacle of elegance but better than reading that file twice for the vast majority of users won't be needed
That being said is the game able to read that ini to begin with? Anyway Bash should be able to run without the game ini - I have over the years kind of centralized this (and added corrupted attribute) but maybe high time to fix this once and for all with a try except that at the end catches yet unknown exceptions and warns somehow.
That's much better, thanks! Pushed it to nightly.
I have no idea if the game handles it correctly, but the user didn't seem to have problems, so probably? UTF-16 is pretty common in the Windows world after all.
BAIN seems to be completely unable to install files with non-Western characters in the name. Either that or something is messed up on my end, because I have no idea how no one would have noticed this over the years 🤔
Try installing this archive via BAIN: Unicode Test.zip Depending on how python/pycharm/pipenv/pywin32 is feeling right now, doing so either hard-crashes, shows an error in the bashbugdump or shows a Windows dialog saying that the filename is 'either too long or invalid' for me.
Noticed this just now when I tried installing FNIS through BAIN, which popped up the 'too long or invalid' dialog for some of FNIS' translation files (since they use such unicode characters):
tools\GenerateFNIS_for_Users\languages\Bahasa Indonesia.txt
tools\GenerateFNIS_for_Users\languages\Deutsch.txt
tools\GenerateFNIS_for_Users\languages\English.txt
tools\GenerateFNIS_for_Users\languages\Español.txt
tools\GenerateFNIS_for_Users\languages\Français.txt
tools\GenerateFNIS_for_Users\languages\Italiano.txt
tools\GenerateFNIS_for_Users\languages\Magyar (Hungarian).txt
tools\GenerateFNIS_for_Users\languages\Norsk Bokmål.txt
tools\GenerateFNIS_for_Users\languages\Polski.txt
tools\GenerateFNIS_for_Users\languages\Português.txt
tools\GenerateFNIS_for_Users\languages\Srpski (Serbian).txt
tools\GenerateFNIS_for_Users\languages\Svenska (Swedish).txt
tools\GenerateFNIS_for_Users\languages\Български (Bulgarian).txt
tools\GenerateFNIS_for_Users\languages\Русский (Russian).txt
tools\GenerateFNIS_for_Users\languages\中文 (Chinese simpl.).txt
tools\GenerateFNIS_for_Users\languages\日本語 (Japanese).txt
tools\GenerateFNIS_for_Users\languages\語言 (Chinese trad.).txt
tools\GenerateFNIS_for_Users\languages\한국어 (Korean).txt
I did some testing re: the point above, and it seems to fail when it gets to SHFileOperation. I tried encoding the source/target that's passed to it, but all encodings except MBCS either caused UnicodeEncodeErrors (e.g. ASCII, obviously) or caused SHFileOperation to fail. MBCS didn't change anything, I still got the Windows dialog saying the filename is 'either invalid or too long'.
Oh, and I can manually extract them to my Data folder using 7zip's file manager just fine, so they're not too long and NTFS seems to handle them fine, it must be something specific to SHFileOperation.
BAIN seems to be completely unable to install files with non-Western characters in the name. Either that or something is messed up on my end, because I have no idea how no one would have noticed this over the years 🤔
I actually remember having this problem around 11-12 years ago and mentioning it in the Wrye Bash forum thread at the time. Basically, an Oblivion mod I can't remember the name of contained a shortcut with a German name in it's Meshes folder and I had to delete the shortcut to install that mod. IIRC, the reply I got was that fixing this would have to be part of a larger Unicode overhaul.
Well, the larger unicode overhaul did happen: 52515a8ec586844d043ed636d3f260a28b969d1f
Wrye Bash can handle unicode fine, for the most part. It just seems that SHFileOperation really doesn't play well with non-Western characters - e.g. installing täst.esp works fine, but when I add chinese characters, it suddenly doesn't work anymore.
MBCS didn't change anything, I still got the Windows dialog saying the filename is 'either invalid or too long'.
cause probably the name was not encoded correctly in the first place (so file no found) - mbcs won't fail it will add question marks or smth
SHFileOperation is deprecated - tried replacing that code with IFileOperation but failed - not even sure if that one is exposed in pywin32, think it was not - plus the API is not any better if not worse (rant: why is so difficult to implement an atomic file operation with a clear return value, OS? https://stackoverflow.com/a/29077006/281545) - I gave up -> edit: https://github.com/wrye-bash/wrye-bash/commits/IFileOperation, don't even remember what this was about, see https://stackoverflow.com/questions/16867615/copy-using-the-windows-copy-dialog/19989764#comment62515234_19989764
If it turns out the (probably unmaintained) SHFileOperation is the problem, we would have to maybe add an exception of sorts- or reimplement it. Dropping is an option but yeah, native OS APIs, in this case, sounds like a good idea (recycle bin, support for skip etc?)
edit: see also:
- https://github.com/wrye-bash/wrye-bash/issues/180#issuecomment-259148738
- https://docs.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-shfileoperationa
- https://docs.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-shfileoperationw
around 11-12 years ago
@XJDHDR the time I was a user of WB myself :P
I don't know what was up with MBCS, in pycharm the encoded string did contain a bunch of question marks, as expected, but Windows itself seemed to somehow still figure out what that means, since the 'invalid or too long' popup correctly showed the 'invalid or too long' filename as Русский (Russian).txt 🤷♀ .
@Utumno Dump Translator is heavily broken. After it was mentioned in #273 I got the basic part working again (it can write new translation files). Then I was dumbfounded that it never wrote any translated lines into the resulting file - added a deprint to the except: and got this:
localize.py 257 dump_translator: Error while dumping translation file:
Traceback (most recent call last):
File "bash\localize.py", line 234, in dump_translator
translated = _(stripped)
File "C:\Python27-32\lib\gettext.py", line 480, in ugettext
return unicode(message)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)
No wonder. dump_translator doesn't even begin to worry about encodings. I don't know how this ever worked, and it definitely won't work on py3. Not sure I want to invest the effort of rewriting this implementation now though :/
Whelp, nevermind that. Got it working correctly with German, now to see if it'll work for Chinese/Russian too.
Edit: Works correctly for Chinese and Russian too :)
@Utumno Nasty traceback:
Traceback (most recent call last):
File "bash\balt.pyo", line 1827, in __Execute
File "bash\balt.pyo", line 890, in _conversation_wrapper
File "bash\basher\installers_links.pyo", line 195, in Execute
File "bash\bosh\bain.pyo", line 2667, in bain_anneal
File "bash\bosh\bain.pyo", line 2625, in _remove_restore
File "bash\bosh\bain.pyo", line 1711, in irefresh
File "bash\ini_files.pyo", line 694, in setBsaRedirection
File "bash\ini_files.pyo", line 644, in saveSetting
File "bash\ini_files.pyo", line 299, in saveSettings
File "bash\ini_files.pyo", line 279, in _open_for_writing
IOError: [Errno 2] No such file or directory: u'C:\Users\Захар\Documents\My Games\Oblivion\Oblivion.ini_unicode_safe.tmp'
Seems to happen because the username is non-ASCII, so the path doesn't encode as ASCII. Path.temp doesn't know that, so it tries to just replace the non-ASCII chars with escape sequences. That obviously doesn't work because the part that isn't ASCII is the username, not the filename.
The obvious solution would be to set unicodeSafe=False in _open_for_writing, but that's impossible because temp is a property - why does it even have that parameter?? 🤔
Edit: maybe we should drop the unicodeSafe part of temp altogether - we already have a unicodeSafe function we can use for the (hopefully) few cases where that behavior is actually needed (i.e. subprocess calls). So we could move the unicode-safe code in there and call with my_path.temp.unicodeSafe() in those places.
maybe we should drop the unicodeSafe part of temp altogether
looks like a better approach overall - py3 would handle that more easily you think?
On Thu, Jul 2, 2020, 21:33 Infernio [email protected] wrote:
@Utumno https://github.com/Utumno Nasty traceback:
Traceback (most recent call last):
File "bash\balt.pyo", line 1827, in __Execute
File "bash\balt.pyo", line 890, in _conversation_wrapper
File "bash\basher\installers_links.pyo", line 195, in Execute
File "bash\bosh\bain.pyo", line 2667, in bain_anneal
File "bash\bosh\bain.pyo", line 2625, in _remove_restore
File "bash\bosh\bain.pyo", line 1711, in irefresh
File "bash\ini_files.pyo", line 694, in setBsaRedirection
File "bash\ini_files.pyo", line 644, in saveSetting
File "bash\ini_files.pyo", line 299, in saveSettings
File "bash\ini_files.pyo", line 279, in _open_for_writing
IOError: [Errno 2] No such file or directory: u'C:\UsersЗахар\Documents\My Games\Oblivion\Oblivion.ini_unicode_safe.tmp'
Seems to happen because the username is non-ASCII, so the path doesn't encode as ASCII. Path.temp doesn't know that, so it tries to just replace the non-ASCII chars with escape sequences. That obviously doesn't work because the part that isn't ASCII is the username, not the filename.
The obvious solution would be to set unicodeSafe=False in _open_for_writing, but that's impossible because temp is a property - why does it even have that parameter?? 🤔
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/wrye-bash/wrye-bash/issues/232#issuecomment-653160815, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKNIVYXQ5RB3KCGLNBV37DRZTHGHANCNFSM4BPCW3DQ .
I'll put together a quick POC and let the user who reported that test it. Only thing I'm afraid of is that this might turn into yet another game of whack-a-mole where more places that need unicodeSafe pop up.
Branch is up at inf-232-unicodeSafe.
@wrye-bash/bashers The os.environ traceback isn't fixed after all:
Traceback (most recent call last):
File "bash\bash.pyo", line 263, in main
File "bash\bash.pyo", line 358, in _main
File "zipextimporter.pyo", line 74, in load_module
File "bash\basher\__init__.pyo", line 122, in <module>
File "bash\env\common.pyo", line 82, in set_env_var
File "encodings\mbcs.pyo", line 21, in decode
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1099-1103: ordinal not in range(128)
Full bashbugdump: BashBugDump.log.txt
Hehe:
elif isinstance(env_value, unicode):
# Expceted, but os.environ uses bytes
env_value = env_value.decode(_fsencoding)
shouldn't that be encode? :P
Apparently py2 ignores the encoding argument when you call decode on a unicode object and tries decoding it again using ASCII:
PS C:\Users\Infernio> py -2
Python 2.7.18 (v2.7.18:8d21aa21f2, Apr 20 2020, 13:25:05) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'äbc'.decode('mbcs')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\mbcs.py", line 21, in decode
return mbcs_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
>>> u'äbc'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
>>> u'äbc'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
Py3 on the other hand:
PS C:\Users\Infernio> py -3
Python 3.9.1 (tags/v3.9.1:1e5d33e, Dec 7 2020, 17:08:21) [MSC v.1927 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'äbc'.decode('mbcs')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> 'äbc'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> 'äbc'.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
Why??? Who on earth thought that py2 behavior was a good idea?
Well it really makes no sense decoding a unicode string - in py 3 the call blows (AttributeError: 'str' object has no attribute 'decode') - as to why: all this was thought convenient at the time till people realized it caused more problems than it solved - anyway calling decode on unicode makes no sense so python devs try to encode to then decode - with asciii (the default)
decode vs encode used to confuse the hell out of me in my early days - seems pretty trivial now (and much simpler on py3)
Ah yes, the tried and true: decode an already decoded string 🤦 I hate typos.
EDIT: I SO cannot wait for py3 upgrade. Python 2 being "helpful" and saying oh you wanted to decode this unicode string? You must have meant to give me bytes! Here let me encode that for you, but I'm going to do that using tha ASCII codec, cause clearly that's the right thing and you didn't just accidentally call the wrong function!
Like if there's not a 100% correct way to convert unicode to bytes in all circumstances, you shouldn't just implicitly do that for me. UGH!! That's only the biggest complaint about py2 of course.