substrait icon indicating copy to clipboard operation
substrait copied to clipboard

LIMIT clause not always working correctly

Open amoeba opened this issue 10 months ago • 1 comments

What happens?

Using LIMIT can result in very strange Substrait plans which also produce incorrect results.

To Reproduce

This simpler example works fine:

# Setup
>>> import duckdb
>>> con.sql("DROP TABLE crossfit;")
>>> con.sql("CREATE TABLE crossfit (exercise text, difficulty_level int);")
>>> con.sql("INSERT INTO crossfit VALUES ('Push Ups', 3), ('Pull Ups', 5) , ('Push Jerk'\
, 7), ('Bar Muscle Up', 10);")
>>> con.sql("select * from crossfit;")
┌───────────────┬──────────────────┐
│   exercise    │ difficulty_level │
│    varchar    │      int32       │
├───────────────┼──────────────────┤
│ Push Ups      │                3 │
│ Pull Ups      │                5 │
│ Push Jerk     │                7 │
│ Bar Muscle Up │               10 │
└───────────────┴──────────────────┘

# This is what will expect to get

>>> con.sql("select * from crossfit limit 1;")
┌──────────┬──────────────────┐
│ exercise │ difficulty_level │
│ varchar  │      int32       │
├──────────┼──────────────────┤
│ Push Ups │                3 │
└──────────┴──────────────────┘

# And we do get what we expect:

>>> print(con.sql(f"CALL from_substrait_json('{con.sql("CALL get_substrait_json('select \
* from crossfit limit 1')").fetchone()[0]}');"))
┌──────────┬──────────────────┐
│ exercise │ difficulty_level │
│ varchar  │      int32       │
├──────────┼──────────────────┤
│ Push Ups │                3 │
└──────────┴──────────────────┘

However another example doesn't work:

(data can be downloaded from https://files.brycemecum.com/penguins.parquet)

>>> import duckdb
>>> con = duckdb.connect()
>>> con.sql("LOAD substrait")
>>> con.sql("CREATE TABLE 'penguins' AS SELECT * FROM 'data/penguins.parquet';")
>>> plan_json = con.sql("CALL get_substrait_json('select * from penguins limit 1;');").fetchone()[0]
>>> print(con.sql(f"CALL from_substrait_json('{plan_json}');"))
┌─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬─────────┬───────┐
│ species │  island   │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │   sex   │ year  │
│ varchar │  varchar  │     double     │    double     │       int32       │    int32    │ varchar │ int32 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼─────────┼───────┤
│ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male    │  2007 │
│ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female  │  2007 │
│ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female  │  2007 │
│ Adelie  │ Torgersen │           NULL │          NULL │              NULL │        NULL │ NULL    │  2007 │
│ Adelie  │ Torgersen │           36.7 │          19.3 │               193 │        3450 │ female  │  2007 │
│ Adelie  │ Torgersen │           39.3 │          20.6 │               190 │        3650 │ male    │  2007 │
│ Adelie  │ Torgersen │           38.9 │          17.8 │               181 │        3625 │ female  │  2007 │
│ Adelie  │ Torgersen │           39.2 │          19.6 │               195 │        4675 │ male    │  2007 │
│ Adelie  │ Torgersen │           34.1 │          18.1 │               193 │        3475 │ NULL    │  2007 │
│ Adelie  │ Torgersen │           42.0 │          20.2 │               190 │        4250 │ NULL    │  2007 │
│   ·     │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│   ·     │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│   ·     │   ·       │             ·  │            ·  │                ·  │          ·  │  ·      │    ·  │
│ Adelie  │ Dream     │           32.1 │          15.5 │               188 │        3050 │ female  │  2009 │
│ Adelie  │ Dream     │           40.7 │          17.0 │               190 │        3725 │ male    │  2009 │
│ Adelie  │ Dream     │           37.3 │          16.8 │               192 │        3000 │ female  │  2009 │
│ Adelie  │ Dream     │           39.0 │          18.7 │               185 │        3650 │ male    │  2009 │
│ Adelie  │ Dream     │           39.2 │          18.6 │               190 │        4250 │ male    │  2009 │
│ Adelie  │ Dream     │           36.6 │          18.4 │               184 │        3475 │ female  │  2009 │
│ Adelie  │ Dream     │           36.0 │          17.8 │               195 │        3450 │ female  │  2009 │
│ Adelie  │ Dream     │           37.8 │          18.1 │               193 │        3750 │ male    │  2009 │
│ Adelie  │ Dream     │           36.0 │          17.1 │               187 │        3700 │ female  │  2009 │
│ Adelie  │ Dream     │           41.5 │          18.5 │               201 │        4000 │ male    │  2009 │
├─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴─────────┴───────┤
│ 152 rows (20 shown)                                                                            8 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

It's clearly done some filtering but the limit isn't respected and the plan it produces has a very strange structure given the query:

Root
|-- Project
    |-- Project
        |-- Sort
            |-- Project
                |-- Join
                    |-- Left
                    |   |-- Read
                    |-- Right
                        |-- Fetch(offset=0, limit=10)
                            |-- Read

whereas the plan only needs to be this complicated:

Root
|-- Fetch(offset=0, limit=10)
    |-- Read

I couldn't figure out how to reproduce with development versions but I did check the box below.

OS:

macOS 15

Substrait-Extension Version:

latest version from install substrait from community

DuckDB Version:

1.2.0

DuckDB Client:

1.2.0

Have you tried this on the latest master branch?

  • [x] I have

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [x] I have

amoeba avatar Mar 03 '25 20:03 amoeba

It appears that the fetchone() in the second example is being run on the construction of the substrait plan which should normally return one row. I am curious where the sorts and joins are coming from. We do know that DuckDB inserts extra project statements to ensure data is accessible, this may be related.

EpsilonPrime avatar Mar 03 '25 21:03 EpsilonPrime