dask-sql
dask-sql copied to clipboard
[BUG] read_parquet filters don't seem to be applied as passed in CREATE TABLE statements
I'm trying to use a CREATE TABLE WITH (... filters=[...]) on a Parquet dataset, and trying to achieve row group filtering based on filters supplied in the CREATE TABLE statement, but the filters don't seem to be passed the Dask reader call successfully.
Apologies I don't have a minimal reproducer here, but I couldn't as yet figure out how to create multiple row-groups in a Parquet file without a "real" dataset, and I'm probably just not passing the filters argument correctly.
sorted_filtered_ddf = dd.read_parquet('/data/taxi_sorted', split_row_groups=True, filters=[[
('dom', '==', 31)
]])
sorted_filtered_ddf.npartitions
# 8
# this works as expected
c.create_table('taxi_sorted_filtered', sorted_filtered_ddf)
c.sql("select * from taxi_sorted_filtered where dom = 31").npartitions
# 8
# this does not
c.sql("""
CREATE OR REPLACE TABLE taxi_sorted_filtered WITH (
location = '/data/taxi_sorted/*.parquet',
gpu=True,
split_row_groups=False,
filters= ARRAY [ ARRAY [
('dom', '==', 31)
]]
)
""")
c.sql("select * from taxi_sorted_filtered where dom = 31").npartitions
# 231
I'm expecting to see only 8 partitions the DF returned from the last select. Instead I see the number of partitions in the total dataset.
related to #375