iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Enable stats collection for nested fields and use write.metadata.metrics.max-inferred-column-defaults to control stats growth

Open greenlaw opened this issue 2 months ago • 0 comments

Feature Request / Improvement

I recently discovered that full stats collection (i.e. lower_bounds/upper_bounds) is explicitly disabled in PyIceberg for nested (i.e. struct child) fields.

This change was made in this PR and specifically this commit.

It seems that this change may have been made to limit the number of fields whose stats are collected when default-full stats collection is enabled. However, after discussion it seems that simply adding support for the write.metadata.metrics.max-inferred-column-defaults table property would be the preferred way to control stats growth. If this is implemented, re-enabling stats collection for nested fields should be a non-issue.

Stats collection for nested struct fields is important for schemas like GeoParquet which store important primitive fields (in this case, bounding box xmin, ymin, xmax, ymax) using structs.

See also this slack thread for discussion.

greenlaw avatar Nov 04 '25 17:11 greenlaw