noisepage
noisepage copied to clipboard
Port Peloton Optimizer Statistics Code
It is time for us to start bringing over @allisonwang's amazing stats code into the new system. Here is an assessment of what needs to happen. I am categorizing the files into three groups:
- Files that we can bring over with only needing to change their code style.
- Files that we can bring over but with major modifications to their underlying storage.
- Files that we will defer bringing over for now.
The last category of files are blocked on classes that we need to port from the optimizer first
(e.g., GroupExpression
, Memo
).
The plan is port the approximation/estimation data structures first. We will then port over the old StatsStorage
classes but temporarily update them to either use in-memory data structures or switch to use the SQLTable
storage back-end. We will hold on porting any of the collection/calculator/sampling code for now.
We need to make sure that we also bring over the testing code as well.
Let's push directly to the statistics branch on the main repo.
Straight Copy
- [x] optimizer/stats/count_min_sketch.h
- [x] optimizer/stats/histogram.h
- [X] optimizer/stats/hyperloglog.h <-- Have to bring over third_party/libcount
- [ ] optimizer/stats/selectivity.h
- [X] optimizer/stats/top_k_elements.h
- [x] optimizer/stats/value_condition.h
Major Modifications
- [ ] optimizer/stats/column_stats.h
- [ ] optimizer/stats/stats_storage.h
- [ ] optimizer/stats/stats_util.h
- [ ] optimizer/stats/table_stats.h
- [ ] optimizer/stats/tuple_sample.h
Deferred
These are some notes to help understand the important points of the code in stats_storage
CreateStatsTableInCatalog
Creates a table of column stats in the catalog
InsertOrUpdateTableStats
Can add/update column stats in a table Iterates through each column stat in the table stats
- gets cardinality of column stat
- gets frac_null of column stat (fraction of null values over total # of values)
- gets an array of the most common <value, freq> pairs (numeric values only) requires getting top k elements
- gets an array of boundary points for the histogram of the column
- converts <value, freq> pairs into pairs of strings (separated by commas) separates those pairs into one string containing all the values and one string containing all the frequencies
- converts histogram points into a string (separated by commas)
- gets column name of column stat
- get bool on whether the column stat has an index or not
- calls InsertOrUpdateColumnStats with above values
InsertOrUpdateColumnStats
Updates or adds column stat in the catalog table