datapusher-plus icon indicating copy to clipboard operation
datapusher-plus copied to clipboard

Low cardinality auto indexing

Open EricSoroos opened this issue 8 months ago • 5 comments

Describe the bug

Auto index generation on cardinality is backwards.

https://github.com/dathere/datapusher-plus/blob/master/datapusher/jobs.py#L1728

    # if a column's cardinality <= AUTO_INDEX_THRESHOLD, create an index for that column

Low cardinality indexes on postgresql aren't useful, because the planner will almost never choose them.

The planner will choose a table scan instead of an index scan if it is likely that many/most pages will be read. When you have a low cardinality index, eg a bool, the planner is going to assume that 1/2 of rows will be hit, and therefore essentially all pages. Even if you've got a distribution that's really skewed, the planner will generally not choose that index. (if you do have a known skewed distribution, you can do a partial/multicolumn index where the condition is the rare case, but that's not really applicable for a general purpose tool. )

This should be reversed, and set the default threshold to ~10 at least.

EricSoroos avatar Jun 10 '24 18:06 EricSoroos