training approx/accurate vs. sampled/not, intuition?

approx/accurate vs. sampled/not, intuition?

Open jowens opened this issue 11 years ago • 2 comments

Something that would give me a little better intuition about what I'm querying could be put in a little 2x2 explanatory chart:

select count from table
select approx_count from table
select count from sampled_table
select approx_count from sampled_table

Are all four of these meaningful? What do these queries actually mean? If you had table and sampled_table at the top and count and approx_count on the left, and put in one sentence for each at their intersection in the chart, I think that would aid understanding.

Aug 30 '13 18:08 jowens

This is great, thanks. Before I add this, the answer is:

select count from table: The number of rows in table, just like normal SQL.
select approx_count from table: Mostly meaningless. As an aside: With the right blinkdb.sample.size and blinkdb.dataset.size, this could be meaningful - BlinkDB will assume that table is itself a random sample from some larger table, and give you error bars accordingly. In the limit (as the dataset size goes to infinity), this gives you error bars as if your dataset were a random sample from "nature". But generally, don't use this.
select count from sampled_table: sampled_table is implemented as a normal table that is a random subset of table, so you can run whatever queries you want on it. In this case you are getting its size, which should be approximately 1% of table's size, assuming you used samplewith 0.01 to create it.
select approx_count from sampled_table: Approximate the value of the query select count from table, using the data in sampled_table.

Aug 30 '13 19:08 henryem

Perfectly sensible, and I think helpful. Good luck on your Markdown!

Aug 30 '13 19:08 jowens

training training copied to clipboard

approx/accurate vs. sampled/not, intuition?

training
training copied to clipboard