django-simple-history
django-simple-history copied to clipboard
as_of with Postgres generates a Sequential Scan with composite index enabled
Describe the bug
When using print(MyModel.history.as_of(as_of).only('id').explain(analyze=True)), it generates a Sequential Scan fromt his code here:
https://github.com/jazzband/django-simple-history/blob/273a4ef3455c729451aa227145cd7aeec5641dd6/simple_history/manager.py#L76-L82
To Reproduce Steps to reproduce the behavior:
- Run
print(MyModel.history.as_of(as_of).only('id').explain(analyze=True)) - Look for
Seq Scan onin the report.
In [13]: print(Submission.history.as_of(as_of).only('id').explain(analyze=True))
Sort (cost=33072.63..33264.17 rows=76618 width=16) (actual time=413.229..418.838 rows=140182 loops=1)
Sort Key: lms_historicalsubmission.history_date DESC, lms_historicalsubmission.history_id DESC
Sort Method: external merge Disk: 3576kB
-> Nested Loop (cost=25555.05..26856.84 rows=76618 width=16) (actual time=119.558..391.156 rows=140182 loops=1)
-> HashAggregate (cost=25554.76..25556.76 rows=200 width=4) (actual time=118.856..137.045 rows=140182 loops=1)
Group Key: u0.history_id
Batches: 5 Memory Usage: 10817kB Disk Usage: 1600kB
-> Unique (cost=22932.70..23698.88 rows=148470 width=16) (actual time=86.763..100.536 rows=140182 loops=1)
-> Sort (cost=22932.70..23315.79 rows=153236 width=16) (actual time=86.762..91.731 rows=145860 loops=1)
Sort Key: u0.id, u0.history_date DESC, u0.history_id DESC
Sort Method: external merge Disk: 3720kB
-> Seq Scan on lms_historicalsubmission u0 (cost=0.00..7113.45 rows=153236 width=16) (actual time=0.598..50.720 rows=145860 loops=1)
Filter: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
-> Index Scan using lms_historicalsubmission_pkey on lms_historicalsubmission (cost=0.29..7.74 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=140182)
Index Cond: (history_id = u0.history_id)
Filter: ((history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone) AND ((history_type)::text <> '-'::text))
Planning Time: 2.461 ms
Execution Time: 425.243 ms
Expected behavior Sequential scans are the worst type of operation since it means every single record is being checked. This should not be the case, especially considering the basic nature of the query.
If we use the sqlite version of latest_of_each which uses a Subquery, we end up hitting an index scan which is significantly better:
In [2]: print(Submission.history.as_of(as_of).only('id').explain(analyze=True))
Incremental Sort (cost=17.69..1321467.26 rows=76618 width=16) (actual time=0.549..290.055 rows=140182 loops=1)
Sort Key: lms_historicalsubmission.history_date DESC, lms_historicalsubmission.history_id DESC
Presorted Key: lms_historicalsubmission.history_date
Full-sort Groups: 4378 Sort Method: quicksort Average Memory: 26kB Peak Memory: 26kB
Pre-sorted Groups: 13 Sort Method: quicksort Average Memory: 25kB Peak Memory: 25kB
-> Index Scan Backward using lms_histori_history_8e6816_idx on lms_historicalsubmission (cost=0.42..1318028.83 rows=76618 width=16) (actual time=0.097..277.867 rows=140182 loops=1)
Index Cond: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
Filter: (((history_type)::text <> '-'::text) AND (SubPlan 1))
Rows Removed by Filter: 5678
SubPlan 1
-> Limit (cost=8.45..8.46 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
-> Sort (cost=8.45..8.46 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
Sort Key: u0.history_date DESC, u0.history_id DESC
Sort Method: quicksort Memory: 25kB
-> Index Scan using lms_historicalsubmission_id_1ed5a117 on lms_historicalsubmission u0 (cost=0.42..8.44 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
Index Cond: (id = lms_historicalsubmission.id)
Filter: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
Planning Time: 3.206 ms
Execution Time: 295.404 ms
Environment (please complete the following information):
- OS: Mac M1 air, 14.2.1
- Browser (if applicable): None
- Django Simple History Version: development / 3.0.0
- Django Version: 4.2.10
- Database Version: Postgres 15.5
Suggested fix
I'd suggest dropping the postgresql logic branch in the latest_of_each method in favor of the Subquery.
@jeking3 I believe you added this method in https://github.com/jazzband/django-simple-history/pull/866. I'd love to get your opinion on this.
I seem to be running into this as well, regardless of whether a composite index is being used. By adding a
with patch("django.db.connection.vendor", "something else"):
context manager around my .as_of() calls to avoid the Postgres-specific branch, I'm seeing up to 6x speed improvements on certain queries.
@ddabble any concern with removing that code block? I think this is an easy win.
Sure, as long as removing it doesn't severely affect performance in some cases (which I assumed was the reason it was added in the first place, but perhaps not..?), I'm all for both improving performance and reducing the amount of DB backend-specific code 🙂 Would be great having some tests analyzing/measuring the query performance, though, but I'm not familiar with any method of doing that easily 🤔
@jeking3 I believe you added this method in https://github.com/jazzband/django-simple-history/pull/866. I'd love to get your opinion on this.
Btw, it looks like the method was mainly added by @TyrantWave in #759, and that @jeking3 simply moved the code in #866