Vector query of SuperDB-generated Parquet is much slower
A query performed in vector runtime has been observed to run much slower if against a Parquet file that was output by super itself when compared to querying against a Parquet file output by DuckDB. Both Parquet files are based on the same original CSV data.
Details
Repro is with super commit 8e86ac0 and DuckDB v1.2.0 5f5512b827.
The repro example below uses the mgbench bench1 CSV data and a simplified version of the bench1/q2 query.
To attempt to transform the CSV into data in what I'd expect to be equivalent Parquet from both tools, I use the following shaper.spq with SuperDB:
type bench1 = {
log_time:time,
machine_name:string,
machine_group:string,
cpu_idle:float32,
cpu_nice:float32,
cpu_system:float32,
cpu_user:float32,
cpu_wio:float32,
disk_free:float32,
disk_total:float32,
part_max_used:float32,
load_fifteen:float32,
load_five:float32,
load_one:float32,
mem_buffers:float32,
mem_cached:float32,
mem_free:float32,
mem_shared:float32,
swap_free:float32,
bytes_in:float32,
bytes_out:float32
}
shape(this, bench1)
And the following ddl_duckdb.sql with DuckDB:
CREATE TABLE bench1 (
log_time TIMESTAMP NOT NULL,
machine_name VARCHAR(25) NOT NULL,
machine_group VARCHAR(15) NOT NULL,
cpu_idle FLOAT,
cpu_nice FLOAT,
cpu_system FLOAT,
cpu_user FLOAT,
cpu_wio FLOAT,
disk_free FLOAT,
disk_total FLOAT,
part_max_used FLOAT,
load_fifteen FLOAT,
load_five FLOAT,
load_one FLOAT,
mem_buffers FLOAT,
mem_cached FLOAT,
mem_free FLOAT,
mem_shared FLOAT,
swap_free FLOAT,
bytes_in FLOAT,
bytes_out FLOAT
);
Creating a Parquet file with super:
$ super -version
Version: v1.18.0-268-g8e86ac0f
$ super -f parquet -o bench1-super.parquet -I shaper.spq bench1.csv
$ ls -l bench1-super.parquet
-rw-r--r-- 1 phil staff 992646437 Feb 6 13:10 bench1-super.parquet
Creating a Parquet file with DuckDB:
$ duckdb -version
v1.2.0 5f5512b827
$ duckdb tmp.db < ddl_duckdb.sql
$ duckdb tmp.db 'COPY bench1 FROM "bench1.csv"'
$ duckdb tmp.db -c 'COPY (FROM bench1) TO "bench1-duckdb.parquet"'
$ ls -l bench1-duckdb.parquet
-rw-r--r-- 1 phil staff 779327374 Feb 6 13:14 bench1-duckdb.parquet
Aside from the ~27% difference in file size, the only other difference I observe is that the file from DuckDB has a named type arrow_timestamp_us for the log_time field, but all other types are the same between the two.
$ super -Z -c 'head 1 | yield typeof(this)' bench1-super.parquet
<{
log_time: time,
machine_name: string,
machine_group: string,
cpu_idle: float32,
cpu_nice: float32,
cpu_system: float32,
cpu_user: float32,
cpu_wio: float32,
disk_free: float32,
disk_total: float32,
part_max_used: float32,
load_fifteen: float32,
load_five: float32,
load_one: float32,
mem_buffers: float32,
mem_cached: float32,
mem_free: float32,
mem_shared: float32,
swap_free: float32,
bytes_in: float32,
bytes_out: float32
}>
$ super -Z -c 'head 1 | yield typeof(this)' bench1-duckdb.parquet
<{
log_time: arrow_timestamp_us=time,
machine_name: string,
machine_group: string,
cpu_idle: float32,
cpu_nice: float32,
cpu_system: float32,
cpu_user: float32,
cpu_wio: float32,
disk_free: float32,
disk_total: float32,
part_max_used: float32,
load_fifteen: float32,
load_five: float32,
load_one: float32,
mem_buffers: float32,
mem_cached: float32,
mem_free: float32,
mem_shared: float32,
swap_free: float32,
bytes_in: float32,
bytes_out: float32
}>
Running a query on my Intel-based Macbook in the vector runtime against both Parquet files, we see the completion time against the one that came from SuperDB is ~4x slower.
$ SUPER_VAM=1 time super -c "from 'bench1-super.parquet'
| where
(grep(cslab*, machine_name) or grep (mslab*, machine_name))
and load_one is null
and log_time >= 2017-01-10T00:00:00Z" > /dev/null
20.80 real 84.20 user 11.10 sys
$ SUPER_VAM=1 time super -c "from 'bench1-duckdb.parquet'
| where
(grep(cslab*, machine_name) or grep (mslab*, machine_name))
and load_one is null
and log_time >= 2017-01-10T00:00:00Z" > /dev/null
5.32 real 13.83 user 1.37 sys