roapi
roapi copied to clipboard
Support partition pruning
Background
assume we have
table/year=2022/month=03/day=20/log.parquet
table/year=2022/month=03/day=21/log.parquet
consider the query
select count(1) from table where year = '2022' and month = '03' and day = '20'
Actual
The above query will scan all parquet files, instead of only one (necessary)
currently, roapi supports reading all parquet files in a directory
tables:
- name: "blogs"
uri: "table/"
option:
format: "parquet"
use_memory_table: false
schema:
# columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc
Expect
The above query will scan only one parquet file
however, since partition column is not in parquet schema, one idea is to improve roapi config to
tables:
- name: "blogs"
uri: "table/"
option:
format: "parquet"
use_memory_table: false
schema:
# columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc
partitions:
- name: "year"
data_type: "Utf8"
- name: "month"
data_type: "Utf8"
- name: "day"
data_type: "Utf8"
Reference
- https://github.com/apache/arrow-datafusion/issues/2061 (2022, Datafusion has file re-structure at late 2022, just for reference)
- https://github.com/roapi/roapi/blob/51e01ef968bb289c9ab70b7d0a4e2c8ae68b88f1/columnq/src/table/mod.rs#L50-L54
- https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/datasource/listing/table.rs#L318-L324
Definitely a good feature to add :+1:
@jychen7 shouldn't such partition directories be inferred automatically as Spark does instead of manually supplying them?
@chitralverma Yes, it is a good idea to support auto-detect partitions. On the other hand, it may be also reasonable to declare partitions manually for non-hive style partitions. E.g.
table/2022/03/20/log.parquet
table/2022/03/21/log.parquet
We also have some tables that are created outside of spark with non-hive style partitions, so being able to provide a custom partition scheme would be very useful to us.