usql icon indicating copy to clipboard operation
usql copied to clipboard

Control characters in text field breaking usql output

Open trantor opened this issue 1 year ago • 6 comments

Hello @kenshaw . I've been dealing with control characters that are present in a text field of a database I need to access. As I can see in this code https://github.com/xo/tblfmt/blame/1af8a162785fd2d26eddb90fbd8ad9d407b3408d/fmt.go#L389 instead of being outputted literally and then, for instance, properly encoded in JSON output, they are rendered as, for instance \x1c for the U+001C character. Apart from behaving differently from every other tool I've used on the database in question (they all output the literal character), it then completely breaks the JSON output, by putting in it the illegal \x sequence. The other options in the switch block of the aforementioned code do not seem much better. Is there a way to just get the raw data in the output? I can't find one in the documentation. There is also the fact that the various output format, JSON for instance, will encode characters in different ways (e.g. the surrogate pairs JSON uses for characters outside the BMP). Also, silently modifying the contents of the data in an arbitrary way without the user being aware of it does seem an approach prone to nasty surprises for the user.

trantor avatar Dec 30 '24 14:12 trantor

The formatter can be fixed, fairly easily, to accommodate this for JSON output. It would be helpful if you could share an example of the row(s) with the bad JSON data, and what other tools do. For the most part, it should be simply changing \x to \u escape codes.

kenshaw avatar Jan 01 '25 23:01 kenshaw

I've pushed a change to tblfmt that should fix this issue for you. Can you please locally update tblfmt in the usql repo, and test against your data set:

$ git clone https://github.com/xo/usql.git && cd usql
$ go get github.com/xo/[email protected]
$ ./build.sh -b && ./usql

Let me know if this fixes all the issues for you. I'll tag a change with this shortly.

kenshaw avatar Jan 02 '25 00:01 kenshaw

@kenshaw I had done much the same as a fix to have a working output. I will try your change as soon as I can manage. However I am wondering about the approach that doesn't differentiate between JSON output and others and does the encoding before choosing the output format.. I've also noticed that characters outside BMP are rendered as-is and there is no option to force ASCII-only output for JSON.

trantor avatar Jan 02 '25 17:01 trantor

@kenshaw I've checked out the JSON output and it seems to be correct now. What I don't understand is how one can get the output text as-is, i.e. raw without usql tampering with it. Also, it's unclear to me what the rationale is to display \x, \u or \U escapes for character according specific unicode ranges, especially with no mention whatsoever of this choice anywhere in the documentation. And what happens with other machine-readable outputs that are not JSON?

trantor avatar Jan 06 '25 00:01 trantor

JSON is a specific encoding standard -- IIRC, it's quite old (predating UTF-8), not friendly to non ASCII characters, and has some non-intuitive encoding requirements.

If you want "raw" character codes from your database, it would probably be better to use the database's actual client, and other tools to encode it to JSON. Alternately, you might want to try the CSV output.

kenshaw avatar Jan 06 '25 07:01 kenshaw

@kenshaw I think I might not have expressed myself properly. When I talked about raw output, I wasn't talking about JSON output. I was talking about the text output. I noticed how the CSV output prints out the data as they actually are, which is good news. However I don't understand the choice of those 3 different \x, \u or \U escapes in the text (and html) output (in what situation are those three used/useful?). Also, I think html output should have character entities at least for non-printable characters. As an aside, it would be nice to have JSON output with a choice whether to encode all non-ASCII characters through \u escapes or not. It's fine as it is now, mind you, since one can always process it with a tool like jq downstream, but it could be useful to have the choice in some situations.

trantor avatar Jan 06 '25 19:01 trantor