cc-index-table icon indicating copy to clipboard operation
cc-index-table copied to clipboard

Add a tool to check if row groups `.min` / `.max` are strictly increasing within a parquet file

Open damian0815 opened this issue 2 months ago • 7 comments

Add a tool to check if row groups .min / .max for a particular column (eg url_surtkey) are strictly increasing within a particular parquet file or collection of parquet files; see README for more information and limitations - in particular, this does not check of the rows are sorted, just that the row groups min/max within a single parquet file are strictly increasing. The tool is intended to help check for #12.

  • [x] Initial implementation
  • [x] Unit tests
  • [x] GitHub workflow

damian0815 avatar Oct 29 '25 13:10 damian0815

TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?

damian0815 avatar Oct 29 '25 14:10 damian0815

I have updated the title and description to better correspond with what the tool does.

damian0815 avatar Oct 31 '25 14:10 damian0815

TBD: is it expected that urls are sorted in between the parquet files, ie should max in part-00001-....gz.parquet always be <= min in part-00002-....gz.parquet?

Determined: this is not intended, ie part-00001.max may be out of order w.r.t part-00002.min

damian0815 avatar Oct 31 '25 14:10 damian0815

@damian0815 This is waiting on @sebastian-nagel to re-review with your changes, correct?

jenenglish avatar Nov 12 '25 18:11 jenenglish

@jenenglish that's correct yes

damian0815 avatar Nov 13 '25 22:11 damian0815

This is a little late, but, if you want to support local files, s3, and https, please use the smart_open package. Don't roll your own.

wumpus avatar Nov 20 '25 17:11 wumpus

... and to contradict myself, turns out that fsspec is a better choice than smart_open. @damian0815 I think this is almost ready to ship if you make these few minor changes.

wumpus avatar Dec 11 '25 18:12 wumpus

@sebastian-nagel thank you for the example of overly-long values causing problems! For this particular situation I'm happy to ignore the lack of statistics, as long as it's rare.

wumpus avatar Dec 22 '25 04:12 wumpus

@damian0815 this PR is ready for a revision

wumpus avatar Dec 22 '25 04:12 wumpus