Parquet: implement efficient attribute and spatial filtering for datasets opened with ArrowDataset
That is for Parquet datasets made of multiple files opened from a directory name, or opening a single parquet file with PARQUET:/path/to/my.parquet (if opening a single .parquet file, without PARQUET: prefixing, OGR already manually decides with row groups to select based on statistics)
This uses arrow::dataset::ScanBuilder::Filter() to translate OGR spatial and attribute filters down to the Arrow execution engine.
- On a Parquet 1.0 WKB file, without a geometry bounding box column:
- Without ArrowDataset, selecting significant amount of features:
$ time ogrinfo nz-building-outlines.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147
real 0m1,905s
user 0m2,128s
sys 0m0,328s
- With ArrowDataset, selecting significant amount of features:
$ time ogrinfo PARQUET:nz-building-outlines.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147
real 0m1,974s
user 0m2,297s
sys 0m1,033s
- Without ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch nz-building-outlines.parquet -spat 1750445 5812014 1912866 5906677
real 0m1,587s
user 0m1,737s
sys 0m0,363s
- With ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch PARQUET:nz-building-outlines.parquet -spat 1750445 5812014 1912866 5906677
real 0m1,489s
user 0m1,599s
sys 0m1,019s
- Without ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo nz-building-outlines.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1
real 0m1,304s
user 0m1,605s
sys 0m0,337s
- With ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo PARQUET:nz-building-outlines.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1
real 0m1,463s
user 0m1,597s
sys 0m0,989s
- Without ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo nz-building-outlines.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1
real 0m1,063s
user 0m1,277s
sys 0m0,311s
- With ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo PARQUET:nz-building-outlines.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1
real 0m1,508s
user 0m1,289s
sys 0m0,969s
- On a Parquet 1.1 WKB file, with a geometry bounding box column, and geometries sorted with a RTree:
- Without ArrowDataset, selecting significant amount of features:
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147
real 0m0,995s
user 0m1,054s
sys 0m0,181s
- With ArrowDataset, selecting significant amount of features:
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147
real 0m0,842s
user 0m1,237s
sys 0m0,298s
- Without ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677
real 0m0,640s
user 0m0,671s
sys 0m0,225s
- With ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677
real 0m0,375s
user 0m0,771s
sys 0m0,301s
- Without ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount1
real 0m0,310s
user 0m0,322s
sys 0m0,147s
- With ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1
real 0m0,210s
user 0m0,304s
sys 0m0,145s
- Without ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1
real 0m0,911s
user 0m1,267s
sys 0m0,321s
- With ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1
real 0m0,570s
user 0m1,339s
sys 0m0,622s
So a mix of cases where performance is (slightly) worse with ArrowDataset, to some where it is 40% faster. All of this is with 4 threads.
FYI @jorisvandenbossche @paleolimbot @kylebarron
coverage: 69.131% (+0.02%) from 69.108% when pulling 1dc79e374287e5224acf89c3a02825f692e92487 on rouault:parquet_dataset_enhancements into 9b9d3e35deba4c41e5b07558817a0569c937490d on OSGeo:master.
@paleolimbot Thank for the review
Fixes https://github.com/OSGeo/gdal/issues/8263