zed icon indicating copy to clipboard operation
zed copied to clipboard

Parquet: Using empty maps to work around unsupported empty records

Open philrz opened this issue 11 months ago • 0 comments

We're aware of this limitation of Parquet:

$ echo '{"foo": {}}' | super -f parquet -o data.parquet -
parquetio: unsupported type: empty record

We've noticed that other tools avoid this problem by using an empty map to represent this data, and it seems we might use the same trick.

Details

Repro is with super commit cf318c7.

I was reminded of this limitation when recently working with the JSONBench test data. In planning to test with all permutations of our supported input formats, but knowing Parquet is columnar with one-schema-per-file, I thought to try running fuse on it.

$ wget https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_0001.json.gz

$ super -version
Version: cf318c79c

$ super -f parquet -o file_0001.parquet -c 'fuse' file_0001.json.gz
parquetio: unsupported type: empty record

Using the same data prep technique used in the super command benchmarks, we find DuckDB doesn't seem to have this problem if we take a round trip through a unified table, i.e., something like their equivalent of fuse.

$ duckdb --version
v1.3.2 (Ossivalis) 0b83e5d2f6

$ duckdb file_0001.db -c "CREATE TABLE file_0001 AS FROM read_json('file_0001.json.gz', union_by_name=true)"
100% ▕████████████████████████████████████████████████████████████▏ 

$ duckdb file_0001.db -c "COPY (FROM file_0001) TO 'file_0001.parquet'"
100% ▕████████████████████████████████████████████████████████████▏ 

$ SUPER_VAM=1 super -c 'count()' file_0001.parquet
1000000::uint64

If we take the same round trip with our original simplified data, we can see that it stores such empty records in a table as a map type, and this is maintained if the table is written out as Parquet.

$ echo '{"foo": {}}' > data.json

$ duckdb data.db -c "CREATE TABLE data AS FROM read_json('data.json', union_by_name=true)"

$ duckdb data.db -c "DESC data;"
┌─────────────┬────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │    column_type     │  null   │   key   │ default │  extra  │
│   varchar   │      varchar       │ varchar │ varchar │ varchar │ varchar │
├─────────────┼────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ foo         │ MAP(VARCHAR, JSON) │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴────────────────────┴─────────┴─────────┴─────────┴─────────┘

$ duckdb data.db -c "COPY (FROM data) TO 'data.parquet'"

$ super -S data.parquet 
{
    foo: |{
    }|::|{string:bytes}|
}

And of course, our JSON writer already outputs empty maps as empty JSON objects, so if we treated the data similarly it would survive a round trip back to original JSON form without being modified from the original.

$ super -f json data.parquet 
{
    "foo": {}
}

I'll also point out that for a non-empty record, DuckDB does not use a map type, so this does feel like a special case.

$ cat data-non-empty-record.json
{"foo": {"bar": "baz"}}

$ duckdb non-empty.db -c "CREATE TABLE nonempty AS FROM read_json('data-non-empty-record.json', union_by_name=true)"

$ duckdb non-empty.db -c "DESC nonempty;"
┌─────────────┬─────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │     column_type     │  null   │   key   │ default │  extra  │
│   varchar   │       varchar       │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ foo         │ STRUCT(bar VARCHAR) │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────────────┴─────────┴─────────┴─────────┴─────────┘

philrz avatar Mar 07 '25 17:03 philrz