4cat icon indicating copy to clipboard operation
4cat copied to clipboard

Windows encoding errors; utf-8 vs. wchar_t

Open dale-wahl opened this issue 4 years ago • 0 comments

I noticed that some analytics will not work on Windows natively due to Windows default encoding wchar_t. csv's DictWriter for example automatically checks for the default locale encoding and attempts to use that. This can cause a UnicodeEncodeError.

I am not sure how to resolve this systematically yet. Adding encoding='utf-8' to the problem processors resolved my issue, but I think that solution is likely to result in more problems (such as whenever we want to read from the same file and 4CAT defaults to wchar_t).

I did find a perhaps more all encompassing solution:

# add this to config.py
if os.name == "nt":
    import _locale
    _locale._gdl_bak = _locale._getdefaultlocale
    _locale._getdefaultlocale = (lambda *args: (_locale._gdl_bak()[0], 'utf8'))

This sets the default encoding to utf-8 if running on Windows. It is a more comprehensive solution, though there could be problems if a user attempts to upload a document to 4CAT or 4CAT otherwise tries to access a file that was encoded in something else. I'm testing it at the moment to see if I run into any issues.

dale-wahl avatar Jul 22 '21 09:07 dale-wahl