duckdb `USING SAMPLE` does not have the same precedence as `LIMIT`

What happens?

USING SAMPLE currently has different precedence/syntax than LIMIT wrt WHEREs and GROUP BYs, which I believe can only be unintentional, since USING SAMPLE is currently semantically applied after the FROM clause but cannot syntactically be placed there.

If the current semantic precedence is indeed intentional, I think it'd be best best to adapt the syntactic precedence by allowing (or demanding?) USING SAMPLE be be placed right after the FROM?

To Reproduce

SELECT 
i
FROM 
range(10) df(i)
WHERE i > 8
USING SAMPLE 1

returns no result rows most of the time because the sample is applied before the WHERE, while

SELECT 
i
FROM 
range(10) df(i)
WHERE i > 8
LIMIT 1

always returns one row because the LIMIT is applied after the WHERE.

Similarly,

SELECT 
sum(i)
FROM 
range(100) df(i)
GROUP BY ()
USING SAMPLE 1

gives random outputs, because the SAMPLE is applied before the GROUP BY, while

SELECT 
sum(i)
FROM 
range(100) df(i)
GROUP BY ()
LIMIT 1

produces the sum over all rows (not the sum of the first row) because the LIMIT is applied after the GROUP BY

┌────────┐
│ sum(i) │
│ int128 │
├────────┤
│   4950 │
└────────┘

Finally, USING SAMPLE is applied before the DISTINCT, as proven by repeated execution of

SELECT DISTINCT range FROM (VALUES (0), (0), (1), (1)) df(range) USING SAMPLE 2

OS:

Linux

DuckDB Version:

'0.10.1-dev717'

DuckDB Client:

Python

Full Name:

Soeren Wolfers

Affiliation:

G-Research

Have you tried this on the latest nightly build?

I have tested with a nightly build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

[X] Yes, I have

Mar 05 '24 00:03 soerenwolfers

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

Jul 02 '24 00:07 github-actions[bot]

Still an issue on 1.0.0

Jul 17 '24 12:07 soerenwolfers

@szarnyasg This ticket has been open for a while and might be easily resolvable as "not planned". Conversely, if the team agrees the precedence should be changed, it's probably an easy fix but one that becomes harder to make the more versions it stays as is.

Aug 25 '24 10:08 soerenwolfers

Hi @soerenwolfers, that's a good point, thanks for bringing this up! Looking at our internal discussions, resolving this is not planned at the moment, so I'm closing this issue as such.

Aug 25 '24 12:08 szarnyasg

@szarnyasg I'm very sorry for the confusion but after I wrote the comment above, I went through the issue again and noticed something even worse than what I described originally: the precedence changes depending on whether JOINS are present. While the original issue I had reported could be accepted as a quirk, I strongly feel that the amended issue description is really not be acceptable as is. Could you reopen / bring this up again in the internal discussions?

Aug 25 '24 12:08 soerenwolfers

@szarnyasg Ping, just in case you didn't see my last comment.

Nov 25 '24 09:11 soerenwolfers

@soerenwolfers, thanks we just took a look. This is an optimizer bug. If you disable the optimizer, the precedence is consistent. We'll fix the optimizer.

Nov 25 '24 10:11 szarnyasg

Thanks. Any appetite to align semantics and syntactics without joins?

Nov 25 '24 10:11 soerenwolfers

@szarnyasg I really think this isn't just an optimizer issue.

As shown below, even with PRAGMA disable_optimizer, a JOIN .. USING behaves semantically different from a CROSS JOIN ... WHERE (because the implicit WHERE in the former is given a different precedence wrt USING SAMPLE than the explicit WHERE in the latter), which I think is counter to what anybody would expect from SQL.

Also, note that for the semantics of the CROSS JOIN (which you describe as the correct, bug-free, behavior) the entire cross-product has to be materialized first for the sample to be applied on. That's super expensive and makes it infeasible to use USING SAMPLE in large JOIN queries.

I maintain that the only semantically sane position for USING SAMPLE to be applied is its current syntactic position. For everything else, people should use TABLESAMPLE.

The following query first applies the filter (i.e., the implicit WHERE clause), then the sample:

PRAGMA disable_optimizer;
CREATE OR REPLACE TABLE df AS (SELECT * AS i FROM range(100_000));
CREATE OR REPLACE TABLE wtf AS (SELECT * AS i FROM range(100_000));
EXPLAIN ANALYZE
SELECT 
    *
FROM 
    df
JOIN 
    wtf
ON df.i = wtf.i
USING SAMPLE 1

...
┌─────────────┴─────────────┐
│      RESERVOIR_SAMPLE     │
│    ────────────────────   │
│    Sample Size: 1 rows    │
│                           │
│           1 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: INNER     │
│     Conditions: i = i     ├──────────────┐
│                           │              │
│        100000 Rows        │              │
│          (0.01s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│             df            ││            wtf            │
│                           ││                           │
│        100000 Rows        ││        100000 Rows        │
│          (0.00s)          ││          (0.00s)          │
└───────────────────────────┘└───────────────────────────┘

whereas the following query performs the sample first and then the filter (i.e., the explicit WHERE clause):

PRAGMA disable_optimizer;
CREATE OR REPLACE TABLE df AS (SELECT * AS i FROM range(100_000));
CREATE OR REPLACE TABLE wtf AS (SELECT * AS i FROM range(100_000));
EXPLAIN ANALYZE
SELECT 
    *
FROM 
    df
CROSS JOIN
    wtf
WHERE df.i = wtf.i
USING SAMPLE 1

...
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│          (i = i)          │
│                           │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      RESERVOIR_SAMPLE     │
│    ────────────────────   │
│    Sample Size: 1 rows    │
│                           │
│           1 Rows          │
│          (0.25s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       CROSS_PRODUCT       │
│    ────────────────────   │
│      10000000000 Rows     ├──────────────┐
│          (0.68s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│             df            ││            wtf            │
│                           ││                           │
│        100000 Rows        ││        100000 Rows        │
│          (0.00s)          ││          (0.00s)          │
└───────────────────────────┘└───────────────────────────┘

Dec 27 '24 13:12 soerenwolfers

Hi,

This should be fixed with https://github.com/duckdb/duckdb/pull/14969 Just tested lastest dev version with the example here as well

Jan 02 '25 18:01 Tmonster

@Tmonster Thanks for your work on this.

Don't you agree that the examples in https://github.com/duckdb/duckdb/issues/10988#issuecomment-2563689699 constitute unhealthy SQL because two queries whose diff is

FROM 
    df
[-CROSS-]JOIN 
    wtf
[-WHERE-]{+ON+} df.i = wtf.i

should return the same result? Is there precedent in DuckDB (or any other SQL implementation) where that's not true?

I also don't understand the arguments in your PR description:

If the sample is applied after filters and group by's to match syntax to semantics, then it has the same effect as a limit. [...] If someone wants a sample of a result, they can just wrap their query in another SELECT * FROM {query} USING SAMPLE(1).

What do you mean by "it has the same effect as a limit"? Samples are random, limit clauses are not. To me, that should be the only difference between the two, and it therefore makes perfect sense for the USING SAMPLE to not only syntactically be placed next to the LIMIT clause but also semantically be executed next to it.
What is your argument that the LIMIT clause should have its current semantic precedence instead of the same as the USING SAMPLE clause? Why couldn't one equally say "If someone wants a limit of a result, they can just wrap their query in another SELECT * FROM {query} LIMIT 1".
If you still prefer the current semantic precedence of USING SAMPLE, why are you against moving it syntactically to after the ON / USING clauses but before the WHERE clause? (to be clear, I'm against that, because it's hard to remember given that it differs from the well-known and very similar LIMIT clause and I still don't get the advantage of insisting on that particular placement, but I think it's better than the current rules)

Jan 03 '25 14:01 soerenwolfers

Keep in mind that USING SAMPLE is only supported by DuckDB, so when it comes to precedence as to where the sampling operator should be placed, there's not really anywhere to look. In General, table sample methods are used usually for table references, which can come from table scans or joins.

To answer 1 and 2: Consider the effect of applying USING SAMPLE in the same place as a LIMIT. We then execute the query on all of the data in table only to return a random sample of it. If we put a limit operator in the same place, with enough data and parallel operators, it would achieve the same result.

So In the spirit of getting samples from table references/scans/joins, the USING SAMPLE is applied after the table scans/joins without having to write a specific subquery to say I want a sample of the join of these tables.

To answer 3. I can agree that syntactically and semantically the USING SAMPLE keywords can be confusing. The usage of a table sample in a query should be between the FROM and WHERE clauses. For that we have the tablesample keyword.

I hope this clears some things up

Jan 06 '25 19:01 Tmonster

@Tmonster Thanks for your responses. I hope I'm not stretching your patience too far.

when it comes to precedent as to where the sampling operator should be placed, there's not really anywhere to look

I didn't mean precedent of the USING SAMPLE clause specifically but of any query with the given minimal diff from my last response, no matter what other clauses, subqueries, operators, etc. are used in conjunction with it (EDIT: I only just learned that it's "there is precedent" not "there is precedence" so I corrected all my posts so far, and your quote too so I don't look like a jerk ;) )

& 2.

If we put a limit operator in the same place, with enough data and parallel operators, it would achieve the same result.

That isn't true: When I run

CREATE TABLE df AS (SELECT * FROM range(100_000_000));
SELECT 
*
FROM df
LIMIT 10

with 8 threads, I get the numbers 0 to 10, and according to the docs this is guaranteed! Even if it weren't guaranteed, thread randomness is not statistical randomness, so using a LIMIT clause when you need random samples for statistical work is just not an option.

In the spirit of getting samples from table references/scans/joins, the USING SAMPLE is applied after the table scans/joins without having to write a specific subquery to say I want a sample of the join of these tables.

This sounds like you're thinking of USING SAMPLE as LIMIT-BUT-FOR-CONVENIENCE-APPLIED-IN-A-DIFFERENCE-PLACE. I don't think that's useful. If somebody wants a limit applied in a difference place, they can already do so using subqueries. No need to introduce a whole new operator for that, and make that operator break two obvious expectations to boot ((a) that INNER JOIN .. ON == CROSS JOIN ... WHERE (b) that semantic placement and syntactic placement agree with each other and with the neighboring and similar LIMIT clause).

The usage of a table sample in a query should be between the FROM and WHERE clauses.

I was aware of the TABLESAMPLE clause, but I don't see how it's relevant to me, since I want USING SAMPLE to apply after the WHERE clause. If anything, I could say that you can already use TABLESAMPLE to apply a sample before the WHERE clause (so satisfying your "in the spirit of getting samples from table references/scans"), which means that we can keep USING SAMPLE to stick to LIMIT's precedence instead (which fortunately is already where it syntactically goes anyway).

In fact, TABLESAMPLE being equivalent to the USING SAMPLE clause in all queries without JOINs makes having both not very useful. I believe there are more queries without JOIN clause (=> current distinction between TABLESAMPLE and USING SAMPLE pointless) than there are queries without WHERE.

Jan 06 '25 20:01 soerenwolfers

Yes given the diff you provided of course this will return the same result, but the result is not always created in the same way. Without optimization, a cross join is computed then the filter and the other an inner join. In the presence of USING SAMPLE, the optimization to plan an inner join does not happen. You can then choose to sample the result of the cross join, or the result of the inner join.

Yes, your example does preserve order, but if you do FROM tbl1 join tbl2 on (tbl1.i = tbl2.i) order is not guaranteed to be preserved.

This sounds like you're thinking of USING SAMPLE as LIMIT-BUT-FOR-CONVENIENCE-APPLIED-IN-A-DIFFERENCE-PLACE

Not really, because it's not a limit, it's a statistically representative sample. This is important in the presence of skew. Since the subset is statistically significant, you do need a whole new operator.

3 & rest)

I get it, semantically the sampling keywords can provide confusion because syntactically the USING SAMPLE clause comes after the WHERE clause. I disagree, however, with applying USING SAMPLE at the same semantic level as LIMIT. According to the docs,

Samples allow you to randomly extract a subset of a dataset. Samples are useful for exploring a dataset faster, as often you might not be interested in the exact answers to queries, but only in rough indications of what the data looks like and what is in the data. Samples allow you to get approximate answers to queries faster, as they reduce the amount of data that needs to pass through the query engine.

The purpose of sampling is to avoid passing complete table data through filters.

So to put the USING SAMPLE semantically at the same place as LIMIT would mean passing complete table data through filters and only keeping some percentage of it or subset of rows. To me, this workflow is not the purpose of sampling. Sampling is for getting a statistical subset of the table data, not a statistical subset of table data after filters.

As for your comment regarding TABLESAMPLE vs USING SAMPLE, the documentation mentions the following

The TABLESAMPLE and USING SAMPLE clauses are identical in terms of syntax and effect, with one important difference: tablesamples sample directly from the table for which they are specified, whereas the sample clause samples after the entire from clause has been resolved. This is relevant when there are joins present in the query plan

We can fix the comment about "identical in terms of syntax". In the end the effect is the same though.

Jan 07 '25 16:01 Tmonster

Thanks, I'll put this issue to rest then. Having talked so much about it, I don't think I personally will mix up the precedence again ;)

Jan 07 '25 19:01 soerenwolfers