borg icon indicating copy to clipboard operation
borg copied to clipboard

JSON API: Encoding?

Open enkore opened this issue 7 years ago • 13 comments

From #2249

Or we fix IO encoding to UTF-8 (irrespective of locale) in JSON mode, which probably makes more sense and is less error prone for downstream developers anyway.

I.e., --log-json → stderr text is always UTF-8, stdin text (so not "borg create -") is always UTF-8 (prompts, passwords), env-vars are read as UTF-8 (dito)

--json → stdout text is always UTF-8

enkore avatar Mar 08 '17 16:03 enkore

Attached to 1.1 milestone.

enkore avatar Aug 06 '17 00:08 enkore

As this is a quite fundamental change, guess it should get some testing, in rc2.

Guess we won't have problems if the json is written to a file, not so sure about when it gets output on the console. Or piped and processed badly by other side.

ThomasWaldmann avatar Aug 15 '17 17:08 ThomasWaldmann

Guess we won't have problems if the json is written to a file

When you write text (strings) to stdout/stderr, then they are encoded to bytes using an encoding guessed by Python. That's independent of whether stdout is connected to a TTY/terminal or redirected to a file.

enkore avatar Aug 16 '17 19:08 enkore

Yes, but you suggested to always use utf-8.

So how does e.g. a cygwin console or latin1/ascii console react when you output utf-8 on it? Guess we could live with funny characters, but it shouldn't crash or hang.

Or when doing borg --json | othertool (and othertool guesses encoding), it might guess wrong when utf-8 is not the native system / fs encoding? If othertool is specialized on borg, it would use the right encoding, but if not, could it be told to use utf-8?

ThomasWaldmann avatar Aug 16 '17 20:08 ThomasWaldmann

Depends on #2925;

Note: borg list already uses UTF-8 regardless of system preference (via safe_encode), but only for listing archive contents.

Yes, but you suggested to always use utf-8.

The alternative is to make step one of using --json: "Replicate the way Python guesses encodings [which changes over Python releases]." i.e. "Use Python.". That's not acceptable.

enkore avatar Aug 18 '17 14:08 enkore

well - a completely ascii-save way could be to do unicode-escape then all unicode is escaped as \u...

RonnyPfannschmidt avatar Aug 18 '17 14:08 RonnyPfannschmidt

just as an idea - why not to skip support for latin1 and other non-unicode terminals now?

latin1 symbols inside utf8 will look the same on latin1 terminal I suppose. Other symbols will not be readable anyway, there is no good solution for this. And there is iconv for those who need 8bit encodings and knows what he is doing.

knutov avatar Aug 25 '17 00:08 knutov

@enkore still working / do you still want to work on this?

ThomasWaldmann avatar Aug 27 '17 12:08 ThomasWaldmann

I prepared an initial, functionally incomplete patch I was completely dissatisfied with. I've been working to fix this for good by replacing most of these interactions (os.environ, input/yes, get_passphrase, ...) to use a iosys-class that determines encoding (from Python) and decodes stuff. But this is still incomplete and touches many of the more annoying parts of the code, so it may be reasonable to just go forward with rc2 and perhaps even 1.1.0 without having this resolved yet — on most (Linux/BSD) systems it will "mostly just work", because UTF-8 is a very widespread locale codeset and typically assumed. (OpenBSD has an especially good grip on things here for a Unix, because they only support UTF-8 and 7-bit ASCII). In this case it may be best to add a short note in the docs to say that encoding will be finalized to UTF-8 later.

This will fall apart on Linux when no locale is configured (because Python will fallback to 7-bit ASCII), or glibc things no locale is configured, or considers the configuration invalid (e.g. partial or missing locale files). And of course every locale that is not UTF-8.

enkore avatar Aug 27 '17 15:08 enkore

OK, so let's have some docs now and the fix later.

ThomasWaldmann avatar Aug 27 '17 19:08 ThomasWaldmann

From https://docs.python.org/3.4/library/sys.html#sys.stdin / sys.stdout / sys.stderr:

The character encoding is platform-dependent.
Under Windows, if the stream is interactive (that is, if its isatty() method returns
True), the console codepage is used, otherwise the ANSI code page.
Under other platforms, the locale encoding is used (see locale.getpreferredencoding()).

Under all platforms though, you can override this value by setting the
PYTHONIOENCODING environment variable before starting Python.

More recent docs:

  • https://docs.python.org/3.8/library/sys.html#sys.stdin
  • https://docs.python.org/3.10/library/sys.html#sys.stdin
  • https://docs.python.org/3.10/c-api/init_config.html#c.PyConfig.stdio_encoding
  • https://docs.python.org/3.8/library/os.html#os.environ refers to getfilesystemencoding:
  • https://docs.python.org/3.8/library/sys.html#sys.getfilesystemencoding

ThomasWaldmann avatar Sep 18 '17 03:09 ThomasWaldmann

OK, so let's have some docs now and the fix for 1.1.1 or so.

The docs were added Sep 9, 2017 #3019 (document utf-8 locale requirement for json mode). It looks like you forgot to remove the "documentation" label

hexagonrecursion avatar Dec 07 '21 10:12 hexagonrecursion

Thanks for the hint, I removed the documentation label.

ThomasWaldmann avatar Dec 07 '21 10:12 ThomasWaldmann

For stdin/stdout/stderr and JSON emitted on stdout (see frontends.rst), guess we could extend #3019 and just point there from the docs, so users invoking borg can adjust their environment variables if they do not use a locale with utf-8 encoding already:

https://docs.python.org/3.8/library/sys.html#sys.stdin reads:

Under all platforms, you can override the character encoding by setting the PYTHONIOENCODING environment variable before starting Python or by using the new -X utf8 command line option and PYTHONUTF8 environment variable. However, for the Windows console, this only applies when PYTHONLEGACYWINDOWSSTDIO is also set.

ThomasWaldmann avatar Jan 29 '23 17:01 ThomasWaldmann

Hmm, did I miss something or can this "fix" just be recommending to use PYTHONIOENCODING=utf-8 if one expects (JSON) streams to be in utf-8 on a legacy OS installation that does not use a utf-8 locale already?

PYTHONUTF8 seems way too intrusive and influences how a lot of stuff works - this could break stuff that worked before.

ThomasWaldmann avatar Jan 29 '23 18:01 ThomasWaldmann