borg icon indicating copy to clipboard operation
borg copied to clipboard

json: Output not guaranteed to be valid UTF-8

Open sophie-h opened this issue 3 years ago • 3 comments

The JSON standard requires strings to be valid UTF-8. There are a bunch of JSON API fields that can contain invalid strings.

  • The b-prefixed fields "intentionally" output invalid strings
  • borg info --json seems to output everything without removing invalid strings parts (for example name or command_line)

Maybe it would be good if borg info and borg list are based on the same metadata source to not do the escape etc. at two points in code.

My current suggestion for 1.2:

  • All field names not prefixed use remove surrogates for their contents
  • We deprecate the use of b-prefixed fields and do not add new ones to borg info. Instead, the borg info --json output has a slightly incompatible change to remove surrogates.
  • As an incompatible change, barchive is no longer contained in borg list for archives per default. Not using the field but having it in the output could already break JSON parsers. It can be added by --fomat {barchive}.
  • We add something like base64_ prefixed fields, probably for everything string. I'm not sure, but I wouldn't be surprised if fields like hostname can also be invalid UTF-8 in theory.

I'm also fine with dropping the b-prefixed fields. Not sure if that breaks anything.

sophie-h avatar Jan 22 '22 17:01 sophie-h

related: #2273

ThomasWaldmann avatar Jan 22 '22 17:01 ThomasWaldmann

Docs say

--json: Note: JSON can only represent text. A “barchive” key is therefore not available.

Even though it appears by default

We add something like base64_ prefixed fields, probably for everything string. I'm not sure, but I wouldn't be surprised if fields like hostname can also be invalid UTF-8 in theory.

iirc Borg uses the resolver to reverse 127.0.0.1 or something like that, so hypothetically it should be ASCII, though some other ways to determine the hostname involve parsing /etc/hosts or sticking /etc/hostname in front of the search domain from /etc/resolv.conf -- the latter two don't cause hangs during network problems.

So... yeah... just offering a b64 field for everything that could be not-strictly-text is a good idea.

The b-fields would only work with a receiver that also supports Python-style surrogate escaping. Likely never used.

enkore avatar Jan 24 '22 20:01 enkore

@ThomasWaldmann I would suggest removing the b-fields from 2.0 because they can deliver invalid JSON. Potentially add base64 or something like that instead.

sophie-h avatar Jul 17 '22 15:07 sophie-h

Actually, for 2.0, maybe it makes sense to just not allow binary values for things like comments, etc.

  • Reject command execution if the value is not valid Unicode
  • Strip invalid utf8 from output if it is caused by bit flip or whatever

sophie-h avatar Oct 07 '22 00:10 sophie-h

https://github.com/borgbackup/borg/pull/7197 barchive, bcomment, bpath were all removed.

  • barchive and bcomment is not needed any more because borg2 will validate these more strictly, so we won't have binary stuff in there. this includes borg transfer rejecting to transfer such archives.
  • bpath was just removed for now and a better solution needs to get implemented. as unix allows non-unicode / binary stuff in paths, we can not avoid that.

ThomasWaldmann avatar Dec 15 '22 18:12 ThomasWaldmann

List of affected Item attributes:

  • path is surrogate-escaped str (s-e-str). paths can have all sorts of weird binary stuff on unix, sometimes old files with non-unicode encodings, samba shares especially.
  • source is s-e-str (same issue as with path, just for symlink targets ehrm "sources")
  • user is s-e-str
  • group is s-e-str
  • acl_* are all bytes
  • hlid is bytes
  • chunks / chunks_healthy is [(bytes, int), ...]
  • xattrs is {bytes: bytes, ...}

List of affected Archive attributes:

  • name is str (solved in borg2, validated unicode)
  • comment is str (solved in borg2, validated unicode)
  • hostname is s-e-str
  • username is s-e-str
  • cmdline, recreate_cmdline is [s-e-str, ...]

ThomasWaldmann avatar Dec 15 '22 18:12 ThomasWaldmann

@sophie-h please check PR #7232. ^^^

ThomasWaldmann avatar Dec 29 '22 00:12 ThomasWaldmann

considering there was no feedback, i hope everyone is happy with it. will merge the PR soon.

ThomasWaldmann avatar Jan 16 '23 16:01 ThomasWaldmann