sybil icon indicating copy to clipboard operation
sybil copied to clipboard

[vet] make a list of weaknesses in sybil

Open okayzed opened this issue 6 years ago • 0 comments

there are some places that i think can use improvement, this task is to list and improve them

accuracy:

  • combining histograms between leaf nodes might lose accuracy, because the leaf nodes are not necessarily using histograms with the same size buckets.
  • the "auto" histogram size is based off the extents of all data in a column. if you use "auto" and filter to a subset with a smaller range, the histogram will be inaccurate. can set the hist bucket manually or use a log hist to remediate
  • a large group by might have intermediate results pruned out during aggregation. the pruning limit is 1000 internal rows for a group of block specs (typically 4 - 8 blocks)

safety:

  • writing a new block of data involves loading all data from the unfinished block and then re-saving it all (instead of appending). this is easier / safer with gob, but maybe not as fast

memory:

  • a large group by with log hists can blow memory up

okayzed avatar May 24 '18 02:05 okayzed