vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Perhaps omit file-level statistics from footer

Open gatesn opened this issue 8 months ago • 2 comments

If we introduce #2990 , then we may be able to serve file-level statistics from a layout, rather than have a special segment in the footer...

gatesn avatar Apr 17 '25 15:04 gatesn

We should leave them in for now since they're very useful.

But I do think they're under-specified in whether they're top-level columns, or include nested columns

gatesn avatar May 22 '25 15:05 gatesn

Currently they are either: for struct: [["col_a_max", "col_a_min",...],...["col_z_max", ..]] non-struct ["max", "min", ...]

I think maybe we should use

StructArray([ "self_stat_$": StructArray(["min", "max", ...]), "stat_$col_a: ..., ... "stat$_col_z: ..., ])

Where all fields are prefixed with an identifier here, "stat_$" (maybe this should be shorter) and there is a special identifier "self_stat_$" for container stats. Since all fields are prefix with "stat_$" | "self_stat_$" there cannot be naming clashes

joseph-isaacs avatar May 22 '25 16:05 joseph-isaacs

I think your comment refers to https://github.com/vortex-data/vortex/issues/1835 @joseph-isaacs. That said, I don't think it's a wire-break since the zoned layout can later choose to support stats for struct dtypes (currently we defer to the stats accumulator that will say there are no stats for structs).

In terms of the file-level stats flatbuffer though, we should define whether this means root fields, or nested fields.

gatesn avatar May 29 '25 07:05 gatesn