Add a tool to check if row groups `.min` / `.max` are strictly increasing within a parquet file
Add a tool to check if row groups .min / .max for a particular column (eg url_surtkey) are strictly increasing within a particular parquet file or collection of parquet files; see README for more information and limitations - in particular, this does not check of the rows are sorted, just that the row groups min/max within a single parquet file are strictly increasing. The tool is intended to help check for #12.
- [x] Initial implementation
- [x] Unit tests
- [x] GitHub workflow
TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?
I have updated the title and description to better correspond with what the tool does.
TBD: is it expected that urls are sorted in between the parquet files, ie should
maxinpart-00001-....gz.parquetalways be <=mininpart-00002-....gz.parquet?
Determined: this is not intended, ie part-00001.max may be out of order w.r.t part-00002.min
@damian0815 This is waiting on @sebastian-nagel to re-review with your changes, correct?
@jenenglish that's correct yes
This is a little late, but, if you want to support local files, s3, and https, please use the smart_open package. Don't roll your own.
... and to contradict myself, turns out that fsspec is a better choice than smart_open. @damian0815 I think this is almost ready to ship if you make these few minor changes.
@sebastian-nagel thank you for the example of overly-long values causing problems! For this particular situation I'm happy to ignore the lack of statistics, as long as it's rare.
@damian0815 this PR is ready for a revision