Performance issues due to offset/limit paradigm on PostgreSQL database
Issue description
This line slows down SQL queries on PostgreSQL 11 significantly as it changes the execution plan.
Steps to reproduce the issue
- Create data
create table dummy_table_left (
id int primary key,
some_text text,
some_comment text
);
insert into dummy_table_left values
(1, 'two', 'three'),
(2, 'two2', 'three'),
(3, 'two3', 'three'),
(4, 'two4', 'three'),
(5, 'two5', 'three');
create table dummy_table_right (
id int primary key,
some_text text,
some_comment text
);
insert into dummy_table_right values
(1, 'two', 'three2'),
(2, 'two2', 'three3'),
(3, 'two3', 'three4'),
(4, 'two4', 'three5'),
(5, 'two5', 'three6')
- Compare execution plan of query
explain
select * from dummy_table_left dtl
join dummy_table_right dtr on dtl.some_text = dtr.some_text;
vs
explain
select * from dummy_table_left dtl
join dummy_table_right dtr on dtl.some_text = dtr.some_text
offset 2
limit 2;
- (or alternatively load more data in the tables and observe the significant increase in runtime by using
explain analyzeinstead)
What's the expected result?
- something similar to
Subquery Scan on row_nunber_wrapper (cost=29.12..284.54 rows=18 width=144)
Filter: ((row_nunber_wrapper.some_label >= 2) AND (row_nunber_wrapper.some_label <= 4))
-> WindowAgg (cost=29.12..230.36 rows=3612 width=148)
-> Hash Join (cost=29.12..176.18 rows=3612 width=140)
Hash Cond: (dtl.some_text = dtr.some_text)
-> Seq Scan on dummy_table_left dtl (cost=0.00..18.50 rows=850 width=68)
-> Hash (cost=18.50..18.50 rows=850 width=68)
-> Seq Scan on dummy_table_right dtr (cost=0.00..18.50 rows=850 width=68)
which can be achieved with
explain
select * from (select row_number() over (order by 1) as "some_label", * from dummy_table_left dtl
join dummy_table_right dtr on dtl.some_text = dtr.some_text
) row_nunber_wrapper where some_label >= 2 and some_label <= 2 + 2;
What's the actual result?
Limit (cost=6.05..12.10 rows=2 width=148)
-> WindowAgg (cost=0.00..10930.81 rows=3612 width=148)
-> Nested Loop (cost=0.00..10876.62 rows=3612 width=140)
Join Filter: (dtl.some_text = dtr.some_text)
-> Seq Scan on dummy_table_left dtl (cost=0.00..18.50 rows=850 width=68)
-> Materialize (cost=0.00..22.75 rows=850 width=68)
-> Seq Scan on dummy_table_right dtr (cost=0.00..18.50 rows=850 width=68)
Additional details / screenshot
- A nested loop on a join is significantly slower than the hash join. Adding the limit and offset impacts the actual execution plan, so this pagination technique cannot be used with constructed / joined queries (on postgres at least).
- Presaving the result into a table would create an IO bottleneck.
- In my use case, the proposed approach with
row_number()takes 300ms compared to 30s with the implemented approach in this module.
@alexdyck Thanks, will take a closer look and will try to implement approach as you described
@alexdyck For those who have a performance issue with the default limit-offset implementation can now use row-number pagination.
row-number pagination available for sqlmodel and sqlalchemy extenations.
Usage example:
@app.get("/users", response_model=Page[UserSchema])
def route(db: Session = Depends(get_db)):
return paginate(db, select(User), query_type="row-number")
I am planing to release a new version in 1-2 weeks.
Just to add something into the topic of the LIMIT-OFFSET on Postgres. The Postgres documentation describes what the @alexdyck is doing wrong
Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY
you don't enforce any ordering this is the problem. Adding ORDER BY id will for sure make the query almost instant because DB will just use pk index to pick requested rows
@dexax Thanks for the explanation.
I am closing this issue as far as we have found the root cause, and it's not a library issue.