Vector query of SuperDB-generated Parquet is much slower

Open philrz opened this issue 1 year ago • 0 comments

A query performed in vector runtime has been observed to run much slower if against a Parquet file that was output by super itself when compared to querying against a Parquet file output by DuckDB. Both Parquet files are based on the same original CSV data.

Details

Repro is with super commit 8e86ac0 and DuckDB v1.2.0 5f5512b827.

The repro example below uses the mgbench bench1 CSV data and a simplified version of the bench1/q2 query.

To attempt to transform the CSV into data in what I'd expect to be equivalent Parquet from both tools, I use the following shaper.spq with SuperDB:

type bench1 = {
  log_time:time,
  machine_name:string,
  machine_group:string,
  cpu_idle:float32,
  cpu_nice:float32,
  cpu_system:float32,
  cpu_user:float32,
  cpu_wio:float32,
  disk_free:float32,
  disk_total:float32,
  part_max_used:float32,
  load_fifteen:float32,
  load_five:float32,
  load_one:float32,
  mem_buffers:float32,
  mem_cached:float32,
  mem_free:float32,
  mem_shared:float32,
  swap_free:float32,
  bytes_in:float32,
  bytes_out:float32
}
shape(this, bench1)

And the following ddl_duckdb.sql with DuckDB:

CREATE TABLE bench1 (
  log_time      TIMESTAMP NOT NULL,
  machine_name  VARCHAR(25) NOT NULL,
  machine_group VARCHAR(15) NOT NULL,
  cpu_idle      FLOAT,
  cpu_nice      FLOAT,
  cpu_system    FLOAT,
  cpu_user      FLOAT,
  cpu_wio       FLOAT,
  disk_free     FLOAT,
  disk_total    FLOAT,
  part_max_used FLOAT,
  load_fifteen  FLOAT,
  load_five     FLOAT,
  load_one      FLOAT,
  mem_buffers   FLOAT,
  mem_cached    FLOAT,
  mem_free      FLOAT,
  mem_shared    FLOAT,
  swap_free     FLOAT,
  bytes_in      FLOAT,
  bytes_out     FLOAT
);

Creating a Parquet file with super:

$ super -version
Version: v1.18.0-268-g8e86ac0f

$ super -f parquet -o bench1-super.parquet -I shaper.spq bench1.csv 

$ ls -l bench1-super.parquet
-rw-r--r--  1 phil  staff  992646437 Feb  6 13:10 bench1-super.parquet

Creating a Parquet file with DuckDB:

$ duckdb -version
v1.2.0 5f5512b827

$ duckdb tmp.db < ddl_duckdb.sql
$ duckdb tmp.db 'COPY bench1 FROM "bench1.csv"'
$ duckdb tmp.db -c 'COPY (FROM bench1) TO "bench1-duckdb.parquet"'

$ ls -l bench1-duckdb.parquet 
-rw-r--r--  1 phil  staff  779327374 Feb  6 13:14 bench1-duckdb.parquet

Aside from the ~27% difference in file size, the only other difference I observe is that the file from DuckDB has a named type arrow_timestamp_us for the log_time field, but all other types are the same between the two.

$ super -Z -c 'head 1 | yield typeof(this)' bench1-super.parquet
<{
    log_time: time,
    machine_name: string,
    machine_group: string,
    cpu_idle: float32,
    cpu_nice: float32,
    cpu_system: float32,
    cpu_user: float32,
    cpu_wio: float32,
    disk_free: float32,
    disk_total: float32,
    part_max_used: float32,
    load_fifteen: float32,
    load_five: float32,
    load_one: float32,
    mem_buffers: float32,
    mem_cached: float32,
    mem_free: float32,
    mem_shared: float32,
    swap_free: float32,
    bytes_in: float32,
    bytes_out: float32
}>

$ super -Z -c 'head 1 | yield typeof(this)' bench1-duckdb.parquet 
<{
    log_time: arrow_timestamp_us=time,
    machine_name: string,
    machine_group: string,
    cpu_idle: float32,
    cpu_nice: float32,
    cpu_system: float32,
    cpu_user: float32,
    cpu_wio: float32,
    disk_free: float32,
    disk_total: float32,
    part_max_used: float32,
    load_fifteen: float32,
    load_five: float32,
    load_one: float32,
    mem_buffers: float32,
    mem_cached: float32,
    mem_free: float32,
    mem_shared: float32,
    swap_free: float32,
    bytes_in: float32,
    bytes_out: float32
}>

Running a query on my Intel-based Macbook in the vector runtime against both Parquet files, we see the completion time against the one that came from SuperDB is ~4x slower.

$ SUPER_VAM=1 time super -c "from 'bench1-super.parquet'
| where
    (grep(cslab*, machine_name) or grep (mslab*, machine_name))
    and load_one is null
    and log_time >= 2017-01-10T00:00:00Z" > /dev/null
       20.80 real        84.20 user        11.10 sys

$ SUPER_VAM=1 time super -c "from 'bench1-duckdb.parquet'
| where
    (grep(cslab*, machine_name) or grep (mslab*, machine_name))
    and load_one is null
    and log_time >= 2017-01-10T00:00:00Z" > /dev/null
        5.32 real        13.83 user         1.37 sys

Feb 06 '25 21:02 philrz