vscode_rainbow_csv icon indicating copy to clipboard operation
vscode_rainbow_csv copied to clipboard

RBQL (JavaScript) always triple-quotes quoted string results

Open anthroid opened this issue 3 years ago • 4 comments

I have a data set I need to manipulate that is in double-quoted CSV format:

"foo A1","foo B1","foo C1"
"foo A2","foo B2","foo C2"

If I do a basic RBQL query, the output comes back without quotes:

SELECT a1, a2, a3

> foo A1,foo B1,foo C1
> foo A2,foo B2,foo C2

I need to return this manipulated data set with the quotes still intact, so I tried putting them back:

SELECT '"' + a1 + '"', '"' + a2 + '"', '"' + a3 + '"'

> """foo A1""","""foo B1""","""foo C1"""
> """foo A2""","""foo B2""","""foo C2"""

SELECT
a1.replace(/^(\w)/, '"$1').replace(/(\w)$/, '$1"'),
a2.replace(/^(\w)/, '"$1').replace(/(\w)$/, '$1"'),
a3.replace(/^(\w)/, '"$1').replace(/(\w)$/, '$1"')

> """foo A1""","""foo B1""","""foo C1"""
> """foo A2""","""foo B2""","""foo C2"""

I also tried creating a UDF:

function quote(value) {
    return '"' + value + '"';
}

// Verify it works:
console.log("Result: " + quote("foo A1"))
> Result: "foo A1"

// Try in RBQL:
SELECT quote(a1), quote(a2), quote(a3)

> """foo A1""","""foo B1""","""foo C1"""
> """foo A2""","""foo B2""","""foo C2"""

It seems the RBQL engine is replacing any instance of " within the resulting value string with """. I can only get it to produce output without quotes, or with triple quotes. Is there any way around this?

If it's a parsing-level issue that's difficult to work around on a query or JS level (which would be understandable), could there be an option? Something like:

  • [ ] Enable double-quoted output values

I can fix the data afterward of course, but it would be much better to have RBQL either handle the quotes or add them if an option is selected.

anthroid avatar Apr 09 '22 22:04 anthroid

I agree this can be solved by adding one or two configuration options i.e. "don't auto-quote output columns (list of columns)" and/or "always escape output fields (list of columns to always escape)". The reason it works this way is the ambiguity with CSV format itself. E.g. it is not clear whether the double quotes in this sample are part of the data or are added just for escaping:

XF1,93%
NP-76,85%
STK-NP-88,94%
"Super foobar",91%
R2D2,33%
"Mega fizzbuzz",55%

By looking at the file we can guess that in this case, double quotes are probably part of the data. On the other hand if we had just these 2 lines sampled from the file:

"Super foobar",91%
"Mega fizzbuzz",55%

We would probably assume that the double quotes were added just for escaping. To avoid this ambiguity when writing output to CSV files RBQL always assumes double quotes to be part of the data and therefore additionally escapes the double quotes (adds single double quotes around the field and duplicates all double-quotes inside the field resulting in triple double quotes on both sides).

The only current workaround is to set the output format as TSV instead of CSV. But adding a configuration option seems like a great proposal, thanks!

mechatroner avatar Apr 10 '22 04:04 mechatroner

Maybe a 3-way option?

Escape double-quoted output fields: Never|Always|Automatically

I can see how the '(list of columns)' part would be nice, but it might be cumbersome to have to specify the individual (or all) columns for each query.

Regarding the option 'Enable double-quoted output values', the idea wasn't to adjust how RBQL is interpreting the values internally, but rather to request that it add quotes to all output values. This would avoid the need to do things like:

SELECT '"' + a1 + '"'

even in cases where the original data set did not contain quotes, but you want or need them in the resulting CSV output.

anthroid avatar Apr 10 '22 06:04 anthroid

The 3-way option sounds great! Although "Always" without column numbers won't be useful in many practical cases with some columns being integers/numbers I think it is very untypical to quote numeric columns while quoting all string columns just in case seems to be more widespread. Besides having this option in the interface (GUI and CLI) it is also possible to make it part of the query - RBQL already allows to add with (headers) to indicate that the first line should be handled as a header, but it is also possible to implement with (alwaysquote) or with (alwaysquote 2,3) (always quote only columns 2 and 3) or with (neverquote). The default mode would still be the current one: with (autoquote).

mechatroner avatar Apr 10 '22 16:04 mechatroner

That sounds awesome and would certainly be a flexible way to control this behavior!

anthroid avatar Apr 10 '22 22:04 anthroid