duckdb-web
duckdb-web copied to clipboard
Pitfalls of JSON auto-detection underdocumented
The documentation at https://duckdb.org/docs/data/json/overview and https://duckdb.org/docs/extensions/json#json-table-functions offer insight on how to use "auto" import to load JSON data into DuckDB.
This feature can be a blessing - or a curse:
I just ran into a scenario where a specific set of single-object JSON files, when loaded via read_json_auto, would cause very confusing data output way down a great many view layers.
The ultimate root cause here was that:
a) DuckDB is incredibly good at making "some sense" of data, fit it into a table, and then just operate on that
b) apparently depending on the specific order of files being read, the ingested data type may be unexpected (in my case: JSON instead of string/varchar)
Because DuckDB is so good at eating anything (ducks are omnivores, after all!), any surprises from data import auto-magic will only be detected somewhat late. The recourse is obvious: specify a fixed struct column format (the documentation of which is also a tad on the short side), see below.
I believe DuckDB could benefit a lot from an expansion on when and how (not) to rely on JSON auto-detection.
This example statement imports some specific Avro encoded JSON documents into DuckDB:
create table rubberduck as
select * from read_json('all-my_*.json', format='unstructured', columns = { value: '
struct(
id string,
moreid struct(
string string
),
something_complex struct(
f1 struct(
string string
),
f2 struct(
string string
),
f3 struct(
string string
),
f4 struct(
string string
),
f5 string
),
buyer string,
beware string,
of string,
yellow struct(
string string
),
ducks struct(
string string
)
)
'
});