Feature Request: Add Approximate Mode/Frequent Items Support Using DataSketches’ FrequentItemsSketch
Hi team
I’m using the excellent DuckDB datasketches extension for large-scale analytics use cases. One common requirement in our datasets is to compute the mode() (most frequent item) per group, but the built-in exact mode() function in DuckDB leads to high memory usage or even OOMs when applied on large, high-cardinality datasets.
Feature Request Please consider adding support for approximate mode estimation using FrequentItemsSketch from Apache DataSketches.
Why is this useful?
- mode() is commonly needed in aggregations over grouped data, e.g.:
SELECT x, y, mode(z) FROM table GROUP BY x, y; - On large datasets (e.g., 30M+ rows, 1K+ groups), the exact mode() leads to memory exhaustion.
- Approximate mode with bounded error would be a great tradeoff and fits well into the sketch philosophy.
References
- Frequent Items Sketches documentation
- Open issue with exact mode
Hi @chitralverma,
Thanks for the thoughtful feature request and for your kind words about the DuckDB datasketches extension — we’re glad to hear it’s proving useful for your large-scale analytics workflows.
We agree that approximate mode estimation via FrequentItemsSketch would be a valuable addition, especially for high-cardinality use cases where exact mode() is impractical. That said, we want to be transparent that this extension is just one part of a larger roadmap, and at the moment, this particular feature is not at the top of our current priorities.
However, if this functionality is urgent for your team or organization, we do offer paid consulting and development services. This helps us prioritize specific features like this one and accelerates their delivery. If you’re interested in exploring that option, feel free to reach out to us at [email protected].
Thanks again for engaging with the project — and we hope to continue improving it with input like yours.
Best, The Query.Farm Team