Differentiating null value and empty string in CSV output
In CSV output, right now SuperDB represents null values and empty strings the same. Other systems render these differently, and we might want to think about doing something similar.
Details
Repro is with super commit 2b862a5.
Given input data.json:
{"ThisISNull":null, "ThisIsEmpty":"", "ThisIString":"baz", "TheWordNullInString": "NULL"}
When rendering this as CSV, super currently shows nothing for both the null and empty string values.
$ super -version
Version: v1.18.0-449-g2b862a5e0
$ super -f csv data.json
ThisISNull,ThisIsEmpty,ThisIString,TheWordNullInString
,,baz,NULL
Whereas some SQL-based systems show:
$ duckdb --version
v1.2.2 7c039464e4
$ duckdb -csv -c "SELECT * FROM data.json"
ThisISNull,ThisIsEmpty,ThisIString,TheWordNullInString
NULL,"",baz,NULL
$ clickhouse --version
ClickHouse local version 25.4.2.31 (official build).
$ clickhouse --query "SELECT * FROM 'data.json'" --format csv
\N,"","baz","NULL"
Apparently in the ClickHouse case the "\N" is their way of trying to differentiate between what had been a NULL value in the table vs. a string that happens to contain the text "NULL". They have other settings that will vary this behavior, but that showed what they do by default.
CSV is notoriously difficult because it means distilling detailed data down to a very limited format, so I respect there's probably no one correct way to go here. However, the fact that other systems have chosen to offer different outputs in this situation suggests that users could benefit from SuperDB doing the same.
I happened to bump into this because running the ClickBench query 16 produces NULL values for some systems but the "nothing" for SuperDB, so my benchmark scripts that "diffs" CSV-based query outputs to make sure they're the same picked up on this as a material difference. I can put a workaround in my diff script to treat these as equivalent, but a change to SuperDB itself may also be justified.
In reviewing this issue, @nwt correctly pointed out that the systems that have been around longer offer knobs that could effectively have them do the same null treatment we showed above with SuperDB, e.g.:
$ clickhouse --query "SELECT * FROM 'data.json' SETTINGS format_csv_null_representation = ''" --format csv
,"","baz","NULL"
$ duckdb -nullvalue "" -csv -c "SELECT * FROM data.json"
ThisISNull,ThisIsEmpty,ThisIString,TheWordNullInString
,"",baz,NULL
...and that at some point we'll surely add similar knobs so users could do the same in reverse and make us match the behavior of tools like DuckDB or ClickHouse if that's what they prefer. Fair enough.
Personally, I'm of the opinion that default behaviors still matter since in my experience many users lack the patience to dig for knobs. Therefore, independent of the topic of knobs, I feel like some eventual review of our default behaviors here might be justified, with my opinion being what ClickHouse did above is perhaps the most sensible default. But I'm speaking as but one user there.