druid icon indicating copy to clipboard operation
druid copied to clipboard

Column footprints and capabilities in segmentMetadata query

Open gianm opened this issue 2 years ago • 3 comments

For understanding and optimizing data footprint, it's valuable to know the footprint (in bytes) and capabilities (dictionary encoded, has index, etc) of each column on a per-segment basis and also on an aggregated basis across an entire datasource.

Today, the main way people get footprint information is by reading meta.smoosh files, which are a CSV descriptor that is part of the physical segment package, and which looks like the following. Each row (other than the first) represents a column part, and is four fields: column name, column part number, start offset, and end offset. The footprint in bytes of a column is the difference between the fourth and third fields.

v1,2147483647,1
__time,0,0,5466
channel,0,5727,8512
isRobot,0,5466,5727

Instead of needing to dig through meta.smoosh files, we'd like users to be able to use segmentMetadata queries (https://druid.apache.org/docs/latest/querying/segmentmetadataquery/) to retrieve the column footprints. For merge: false, users would get column footprints from each segment individually. For merge: true, users would get total footprint for each column across all segments.

On the capabilities side, certain capabilities, like hasMultipleValues, are exposed in segmentMetadata queries today. However, they aren't all exposed. We should expose all of them, so people can see which columns have dictionaries, which dictionaries are sorted, etc.

As a follow-up to this feature, we could add column footprint and capability to system tables in SQL. System tables are built on top of segmentMetadata queries, so adding the fields to segmentMetadata is the first step. (Note that adding to segmentMetadata is also valuable by itself, since users can issue segmentMetadata queries directly. The SQL would be merely a convenience.)

gianm avatar Nov 16 '23 20:11 gianm

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Aug 23 '24 00:08 github-actions[bot]

i think this would be satisfied by #16132 once I get back to it, hopefully not too much time from now, but i've a few things I'm planning on doing first

clintropolis avatar Aug 23 '24 01:08 clintropolis

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar May 31 '25 00:05 github-actions[bot]