gdal icon indicating copy to clipboard operation
gdal copied to clipboard

Parquet: implement efficient attribute and spatial filtering for datasets opened with ArrowDataset

Open rouault opened this issue 1 year ago • 3 comments

That is for Parquet datasets made of multiple files opened from a directory name, or opening a single parquet file with PARQUET:/path/to/my.parquet (if opening a single .parquet file, without PARQUET: prefixing, OGR already manually decides with row groups to select based on statistics)

This uses arrow::dataset::ScanBuilder::Filter() to translate OGR spatial and attribute filters down to the Arrow execution engine.

  1. On a Parquet 1.0 WKB file, without a geometry bounding box column:
  • Without ArrowDataset, selecting significant amount of features:
$ time ogrinfo nz-building-outlines.parquet  -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real	0m1,905s
user	0m2,128s
sys	0m0,328s
  • With ArrowDataset, selecting significant amount of features:
$ time ogrinfo PARQUET:nz-building-outlines.parquet  -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real	0m1,974s
user	0m2,297s
sys	0m1,033s
  • Without ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch nz-building-outlines.parquet  -spat 1750445 5812014 1912866 5906677

real	0m1,587s
user	0m1,737s
sys	0m0,363s
  • With ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch PARQUET:nz-building-outlines.parquet  -spat 1750445 5812014 1912866 5906677

real	0m1,489s
user	0m1,599s
sys	0m1,019s
  • Without ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo nz-building-outlines.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m1,304s
user	0m1,605s
sys	0m0,337s
  • With ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo PARQUET:nz-building-outlines.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m1,463s
user	0m1,597s
sys	0m0,989s
  • Without ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo nz-building-outlines.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m1,063s
user	0m1,277s
sys	0m0,311s
  • With ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo PARQUET:nz-building-outlines.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m1,508s
user	0m1,289s
sys	0m0,969s
  1. On a Parquet 1.1 WKB file, with a geometry bounding box column, and geometries sorted with a RTree:
  • Without ArrowDataset, selecting significant amount of features:
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real	0m0,995s
user	0m1,054s
sys	0m0,181s
  • With ArrowDataset, selecting significant amount of features:
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677 -al -so -json -noextent | jq .layers[0].featureCount
471147

real	0m0,842s
user	0m1,237s
sys	0m0,298s
  • Without ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677

real	0m0,640s
user	0m0,671s
sys	0m0,225s
  • With ArrowDataset, selecting significant amount of features, using ArrowArray batch reading:
$ time bench_ogr_batch PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1750445 5812014 1912866 5906677

real	0m0,375s
user	0m0,771s
sys	0m0,301s
  • Without ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount1

real	0m0,310s
user	0m0,322s
sys	0m0,147s
  • With ArrowDataset, selecting just 1 feature by bbox
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -spat 1818654 5546189 1818655 5546190 -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m0,210s
user	0m0,304s
sys	0m0,145s
  • Without ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo nz-building-outlines_with_spi_sorted.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m0,911s
user	0m1,267s
sys	0m0,321s

  • With ArrowDataset, selecting just 1 feature by attribute filter
$ time ogrinfo PARQUET:nz-building-outlines_with_spi_sorted.parquet -where "building_id = 2295742" -ro -al -so -json -noextent | jq .layers[0].featureCount
1

real	0m0,570s
user	0m1,339s
sys	0m0,622s

So a mix of cases where performance is (slightly) worse with ArrowDataset, to some where it is 40% faster. All of this is with 4 threads.

FYI @jorisvandenbossche @paleolimbot @kylebarron

rouault avatar May 14 '24 20:05 rouault

Coverage Status

coverage: 69.131% (+0.02%) from 69.108% when pulling 1dc79e374287e5224acf89c3a02825f692e92487 on rouault:parquet_dataset_enhancements into 9b9d3e35deba4c41e5b07558817a0569c937490d on OSGeo:master.

coveralls avatar May 15 '24 01:05 coveralls

@paleolimbot Thank for the review

rouault avatar May 15 '24 15:05 rouault

Fixes https://github.com/OSGeo/gdal/issues/8263

rouault avatar May 16 '24 22:05 rouault