django-simple-history icon indicating copy to clipboard operation
django-simple-history copied to clipboard

as_of with Postgres generates a Sequential Scan with composite index enabled

Open tim-schilling opened this issue 1 year ago • 5 comments

Describe the bug When using print(MyModel.history.as_of(as_of).only('id').explain(analyze=True)), it generates a Sequential Scan fromt his code here: https://github.com/jazzband/django-simple-history/blob/273a4ef3455c729451aa227145cd7aeec5641dd6/simple_history/manager.py#L76-L82

To Reproduce Steps to reproduce the behavior:

  1. Run print(MyModel.history.as_of(as_of).only('id').explain(analyze=True))
  2. Look for Seq Scan on in the report.
In [13]: print(Submission.history.as_of(as_of).only('id').explain(analyze=True))
Sort  (cost=33072.63..33264.17 rows=76618 width=16) (actual time=413.229..418.838 rows=140182 loops=1)
  Sort Key: lms_historicalsubmission.history_date DESC, lms_historicalsubmission.history_id DESC
  Sort Method: external merge  Disk: 3576kB
  ->  Nested Loop  (cost=25555.05..26856.84 rows=76618 width=16) (actual time=119.558..391.156 rows=140182 loops=1)
        ->  HashAggregate  (cost=25554.76..25556.76 rows=200 width=4) (actual time=118.856..137.045 rows=140182 loops=1)
              Group Key: u0.history_id
              Batches: 5  Memory Usage: 10817kB  Disk Usage: 1600kB
              ->  Unique  (cost=22932.70..23698.88 rows=148470 width=16) (actual time=86.763..100.536 rows=140182 loops=1)
                    ->  Sort  (cost=22932.70..23315.79 rows=153236 width=16) (actual time=86.762..91.731 rows=145860 loops=1)
                          Sort Key: u0.id, u0.history_date DESC, u0.history_id DESC
                          Sort Method: external merge  Disk: 3720kB
                          ->  Seq Scan on lms_historicalsubmission u0  (cost=0.00..7113.45 rows=153236 width=16) (actual time=0.598..50.720 rows=145860 loops=1)
                                Filter: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
        ->  Index Scan using lms_historicalsubmission_pkey on lms_historicalsubmission  (cost=0.29..7.74 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=140182)
              Index Cond: (history_id = u0.history_id)
              Filter: ((history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone) AND ((history_type)::text <> '-'::text))
Planning Time: 2.461 ms
Execution Time: 425.243 ms

Expected behavior Sequential scans are the worst type of operation since it means every single record is being checked. This should not be the case, especially considering the basic nature of the query.

If we use the sqlite version of latest_of_each which uses a Subquery, we end up hitting an index scan which is significantly better:

In [2]: print(Submission.history.as_of(as_of).only('id').explain(analyze=True))
Incremental Sort  (cost=17.69..1321467.26 rows=76618 width=16) (actual time=0.549..290.055 rows=140182 loops=1)
  Sort Key: lms_historicalsubmission.history_date DESC, lms_historicalsubmission.history_id DESC
  Presorted Key: lms_historicalsubmission.history_date
  Full-sort Groups: 4378  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
  Pre-sorted Groups: 13  Sort Method: quicksort  Average Memory: 25kB  Peak Memory: 25kB
  ->  Index Scan Backward using lms_histori_history_8e6816_idx on lms_historicalsubmission  (cost=0.42..1318028.83 rows=76618 width=16) (actual time=0.097..277.867 rows=140182 loops=1)
        Index Cond: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
        Filter: (((history_type)::text <> '-'::text) AND (SubPlan 1))
        Rows Removed by Filter: 5678
        SubPlan 1
          ->  Limit  (cost=8.45..8.46 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                ->  Sort  (cost=8.45..8.46 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                      Sort Key: u0.history_date DESC, u0.history_id DESC
                      Sort Method: quicksort  Memory: 25kB
                      ->  Index Scan using lms_historicalsubmission_id_1ed5a117 on lms_historicalsubmission u0  (cost=0.42..8.44 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=145860)
                            Index Cond: (id = lms_historicalsubmission.id)
                            Filter: (history_date <= '2024-02-16 19:40:57.09347+00'::timestamp with time zone)
Planning Time: 3.206 ms
Execution Time: 295.404 ms

Environment (please complete the following information):

  • OS: Mac M1 air, 14.2.1
  • Browser (if applicable): None
  • Django Simple History Version: development / 3.0.0
  • Django Version: 4.2.10
  • Database Version: Postgres 15.5

Suggested fix I'd suggest dropping the postgresql logic branch in the latest_of_each method in favor of the Subquery.

tim-schilling avatar Feb 19 '24 17:02 tim-schilling

@jeking3 I believe you added this method in https://github.com/jazzband/django-simple-history/pull/866. I'd love to get your opinion on this.

tim-schilling avatar Feb 19 '24 17:02 tim-schilling

I seem to be running into this as well, regardless of whether a composite index is being used. By adding a with patch("django.db.connection.vendor", "something else"): context manager around my .as_of() calls to avoid the Postgres-specific branch, I'm seeing up to 6x speed improvements on certain queries.

dmartin avatar Apr 29 '24 23:04 dmartin

@ddabble any concern with removing that code block? I think this is an easy win.

tim-schilling avatar May 02 '24 19:05 tim-schilling

Sure, as long as removing it doesn't severely affect performance in some cases (which I assumed was the reason it was added in the first place, but perhaps not..?), I'm all for both improving performance and reducing the amount of DB backend-specific code 🙂 Would be great having some tests analyzing/measuring the query performance, though, but I'm not familiar with any method of doing that easily 🤔

ddabble avatar May 03 '24 17:05 ddabble

@jeking3 I believe you added this method in https://github.com/jazzband/django-simple-history/pull/866. I'd love to get your opinion on this.

Btw, it looks like the method was mainly added by @TyrantWave in #759, and that @jeking3 simply moved the code in #866

ddabble avatar May 03 '24 17:05 ddabble