sybil
sybil copied to clipboard
[vet] make a list of weaknesses in sybil
there are some places that i think can use improvement, this task is to list and improve them
accuracy:
- combining histograms between leaf nodes might lose accuracy, because the leaf nodes are not necessarily using histograms with the same size buckets.
- the "auto" histogram size is based off the extents of all data in a column. if you use "auto" and filter to a subset with a smaller range, the histogram will be inaccurate. can set the hist bucket manually or use a log hist to remediate
- a large group by might have intermediate results pruned out during aggregation. the pruning limit is 1000 internal rows for a group of block specs (typically 4 - 8 blocks)
safety:
- writing a new block of data involves loading all data from the unfinished block and then re-saving it all (instead of appending). this is easier / safer with gob, but maybe not as fast
memory:
- a large group by with log hists can blow memory up