roapi icon indicating copy to clipboard operation
roapi copied to clipboard

Support partition pruning

Open jychen7 opened this issue 2 years ago • 4 comments

Background

assume we have

table/year=2022/month=03/day=20/log.parquet
table/year=2022/month=03/day=21/log.parquet

consider the query

select count(1) from table where year = '2022' and month = '03' and day = '20'

Actual

The above query will scan all parquet files, instead of only one (necessary)

currently, roapi supports reading all parquet files in a directory

tables:
  - name: "blogs"
    uri: "table/"
    option:
      format: "parquet"
      use_memory_table: false
    schema:
       # columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc

Expect

The above query will scan only one parquet file

however, since partition column is not in parquet schema, one idea is to improve roapi config to

tables:
  - name: "blogs"
    uri: "table/"
    option:
      format: "parquet"
      use_memory_table: false
    schema:
      # columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc
      partitions:
        - name: "year"
           data_type: "Utf8"
        - name: "month"
           data_type: "Utf8"
        - name: "day"
           data_type: "Utf8"

Reference

  1. https://github.com/apache/arrow-datafusion/issues/2061 (2022, Datafusion has file re-structure at late 2022, just for reference)
  2. https://github.com/roapi/roapi/blob/51e01ef968bb289c9ab70b7d0a4e2c8ae68b88f1/columnq/src/table/mod.rs#L50-L54
  3. https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/datasource/listing/table.rs#L318-L324

jychen7 avatar Mar 04 '23 14:03 jychen7

Definitely a good feature to add :+1:

houqp avatar Mar 06 '23 08:03 houqp

@jychen7 shouldn't such partition directories be inferred automatically as Spark does instead of manually supplying them?

chitralverma avatar Apr 24 '23 16:04 chitralverma

@chitralverma Yes, it is a good idea to support auto-detect partitions. On the other hand, it may be also reasonable to declare partitions manually for non-hive style partitions. E.g.

table/2022/03/20/log.parquet
table/2022/03/21/log.parquet

jychen7 avatar Jun 05 '23 02:06 jychen7

We also have some tables that are created outside of spark with non-hive style partitions, so being able to provide a custom partition scheme would be very useful to us.

houqp avatar Jun 05 '23 04:06 houqp