training
training copied to clipboard
approx/accurate vs. sampled/not, intuition?
Something that would give me a little better intuition about what I'm querying could be put in a little 2x2 explanatory chart:
select count from table
select approx_count from table
select count from sampled_table
select approx_count from sampled_table
Are all four of these meaningful? What do these queries actually mean? If you had table and sampled_table at the top and count and approx_count on the left, and put in one sentence for each at their intersection in the chart, I think that would aid understanding.
This is great, thanks. Before I add this, the answer is:
-
select count from table
: The number of rows intable
, just like normal SQL. -
select approx_count from table
: Mostly meaningless. As an aside: With the right blinkdb.sample.size and blinkdb.dataset.size, this could be meaningful - BlinkDB will assume thattable
is itself a random sample from some larger table, and give you error bars accordingly. In the limit (as the dataset size goes to infinity), this gives you error bars as if your dataset were a random sample from "nature". But generally, don't use this. -
select count from sampled_table
:sampled_table
is implemented as a normal table that is a random subset oftable
, so you can run whatever queries you want on it. In this case you are getting its size, which should be approximately 1% oftable
's size, assuming you usedsamplewith 0.01
to create it. -
select approx_count from sampled_table
: Approximate the value of the queryselect count from table
, using the data insampled_table
.
Perfectly sensible, and I think helpful. Good luck on your Markdown!