akka-persistence-jdbc Prefer filtering over columns on the tag table

While running some experiments and studying the query plan for queryByTag over a postgres database with 153k events I found some surprising results:

Intro: the slick query for queryByTag

Slick produces the following SQL:

select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x2."ordering" > 151815)
                   and (x2."ordering" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;

the query above would be for tag carts-0 reading from offset 151815 where the max ordering in the journal table is 153815. The fourth input on that query is the batch size (in this case 500).

Analising how it performs

When testing different values for the offset and checking their query plan I noticed huge differences:

First, I use a reasonable offset and limit:

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
...
        (x3."tag" = 'carts-0')
           and (
                   (x2."ordering" > 151815)
                   and (x2."ordering" <= 153815))
          )
...
                               QUERY PLAN
-------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..875.91 rows=45 width=178) (actual time=0.155..52.980 rows=402 loops=1)
   ->  Nested Loop  (cost=0.84..875.91 rows=45 width=178) (actual time=0.140..49.324 rows=402 loops=1)
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..19.11 rows=220 width=178) (actual time=0.081..11.503 rows=2000 loops=1)
               Index Cond: ((ordering > 151815) AND (ordering <= 153815))
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..3.89 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=2000)
               Index Cond: ((event_id = x2.ordering) AND (tag = 'carts-0'::text))
               Heap Fetches: 402
 Planning Time: 0.285 ms
 Execution Time: 54.818 ms
(9 rows)

It produces a consistent 55ms execution time.

Then we try a big offset and equal limit. This processor is lagging behind...

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
...
        (x3."tag" = 'carts-0')
           and (
                   (x2."ordering" > 141815)
                   and (x2."ordering" <= 153815))
          )
...
                               QUERY PLAN
-------------------------------------------------------------------------------------------------
 Limit  (cost=2936.03..2936.71 rows=270 width=178) (actual time=448.314..454.791 rows=500 loops=1)
   ->  Sort  (cost=2936.03..2936.71 rows=270 width=178) (actual time=448.303..450.377 rows=500 loops=1)
         Sort Key: x2.ordering
         Sort Method: top-N heapsort  Memory: 195kB
         ->  Hash Join  (cost=101.96..2925.13 rows=270 width=178) (actual time=369.357..436.492 rows=2399 loops=1)
               Hash Cond: (x3.event_id = x2.ordering)
               ->  Seq Scan on event_tag x3  (cost=0.00..2740.32 rows=31560 width=8) (actual time=0.012..156.151 rows=31132 loops=1)
                     Filter: ((tag)::text = 'carts-0'::text)
                     Rows Removed by Filter: 122683
               ->  Hash  (cost=85.44..85.44 rows=1322 width=178) (actual time=133.680..133.693 rows=12000 loops=1)
                     Buckets: 16384 (originally 2048)  Batches: 1 (originally 1)  Memory Usage: 2671kB
                     ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..85.44 rows=1322 width=178) (actual time=0.038..70.999 rows=12000 loops=1)
                           Index Cond: ((ordering > 141815) AND (ordering <= 153815))
 Planning Time: 0.225 ms
 Execution Time: 456.892 ms
(15 rows)

And the execution time does increase and even the query plan looks completely different (including a sequential scan to filter by tag).

Here's where things get weird. This is now a query for a (almost) new projection that has the full journal to catch up on:

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
...
        (x3."tag" = 'carts-0')
           and (
                   (x2."ordering" > 15)
                   and (x2."ordering" <= 153815))
          )
...
                               QUERY PLAN
-------------------------------------------------------------------------------------------------
 Limit  (cost=1.15..219.65 rows=500 width=178) (actual time=0.160..17.538 rows=500 loops=1)
   ->  Merge Join  (cost=1.15..13770.12 rows=31508 width=178) (actual time=0.150..12.829 rows=500 loops=1)
         Merge Cond: (x2.ordering = x3.event_id)
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..9193.54 rows=153527 width=178) (actual time=0.019..2.816 rows=500 loops=1)
               Index Cond: ((ordering > 15) AND (ordering <= 153815))
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..3834.17 rows=31511 width=8) (actual time=0.018..2.567 rows=511 loops=1)
               Index Cond: (tag = 'carts-0'::text)
               Heap Fetches: 0
 Planning Time: 0.209 ms
 Execution Time: 19.825 ms
(10 rows)

And the execution time goes back to similar values (or better) than in case 1..

Proposed solution

The columns journal.ordering and event_tag.event_id contain the same values (event_tag.event_id has a foreign key to journal.ordering). Then, we can alter the filter condition so the index in event_tag is used both by the tag equality and the offset range:

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 151815)  <--- here's the change
                   and (x3."event_id" <= 153815))   <--- here's the change
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;

When doing that, here's what happens:

Execution Time	1st query (151815-153815)	2nd query (141815-153815)	3rd query (15-153815)
original slick	~50ms	~450ms	~18ms
modified	~17.5ms	~20.4ms	~18.7ms

Full details

Taking the baseline queryByTag queries, prefer using the tab columns when filtering so the journal index is only used for ordering

:

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 151815)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;
                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..379.30 rows=45 width=178) (actual time=0.065..15.550 rows=402 loops=1)
   ->  Nested Loop  (cost=0.84..379.30 rows=45 width=178) (actual time=0.050..11.572 rows=402 loops=1)
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..7.61 rows=45 width=8) (actual time=0.018..2.106 rows=402 loops=1)
               Index Cond: ((event_id > 151815) AND (event_id <= 153815) AND (tag = 'carts-0'::text))
               Heap Fetches: 0
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..8.26 rows=1 width=178) (actual time=0.007..0.007 rows=1 loops=402)
               Index Cond: (ordering = x3.event_id)
 Planning Time: 0.261 ms
 Execution Time: 17.569 ms
(9 rows)

Then we try a big offset and equal limit.

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 141815)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;
                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..2038.09 rows=271 width=178) (actual time=0.059..18.178 rows=500 loops=1)
   ->  Nested Loop  (cost=0.84..2038.09 rows=271 width=178) (actual time=0.049..13.713 rows=500 loops=1)
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..43.53 rows=271 width=8) (actual time=0.021..2.439 rows=500 loops=1)
               Index Cond: ((event_id > 141815) AND (event_id <= 153815) AND (tag = 'carts-0'::text))
               Heap Fetches: 0
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..7.36 rows=1 width=178) (actual time=0.007..0.007 rows=1 loops=500)
               Index Cond: (ordering = x3.event_id)
 Planning Time: 0.192 ms
 Execution Time: 20.472 ms
(9 rows)

This is now a query for a completely new projection that has the full journal to catch up on:

explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 15)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;
                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.13..219.43 rows=500 width=178) (actual time=0.238..16.581 rows=500 loops=1)
   ->  Merge Join  (cost=1.13..13757.42 rows=31508 width=178) (actual time=0.228..12.262 rows=500 loops=1)
         Merge Cond: (x2.ordering = x3.event_id)
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..8426.16 rows=153544 width=178) (actual time=0.012..2.934 rows=515 loops=1)
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..4595.69 rows=31508 width=8) (actual time=0.023..2.304 rows=500 loops=1)
               Index Cond: ((event_id > 15) AND (event_id <= 153815) AND (tag = 'carts-0'::text))
               Heap Fetches: 0
 Planning Time: 0.221 ms
 Execution Time: 18.745 ms
(9 rows)

CONCLUSION: Using the tag table column (event_id) for filtering makes the execution time of the query much faster and stable.

Filter and order over a single index

Taking the baseline queryByTag queries, prefer using the tab columns when filtering so the journal index is only used for ordering

A somewhat regular query:

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 151815)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x3."event_id" limit 500;

                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..379.30 rows=45 width=186) (actual time=0.063..14.669 rows=402 loops=1)
   ->  Nested Loop  (cost=0.84..379.30 rows=45 width=186) (actual time=0.051..10.953 rows=402 loops=1)
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..7.61 rows=45 width=8) (actual time=0.019..2.092 rows=402 loops=1)
               Index Cond: ((event_id > 151815) AND (event_id <= 153815) AND (tag = 'carts-0'::text))
               Heap Fetches: 0
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..8.26 rows=1 width=178) (actual time=0.007..0.007 rows=1 loops=402)
               Index Cond: (ordering = x3.event_id)
 Planning Time: 0.215 ms
 Execution Time: 16.660 ms
(9 rows)

Jul 15 '21 10:07 ignasi35

Of course, there are a million external conditionals that may be affecting this quick and dirty study (running the DB on docker on a local laptop, small dataset of 150k instead of millions of events,...) but what I find most relevant of the change is the fact that the execution time becomes stable (I'm not that concerned about if or when it is faster)

Jul 15 '21 10:07 ignasi35

BONUS

the order by x2."ordering" clause uses the index in the journal table instead of using the index on the evetn_tag table. I've done some tests changing both the filter criteria (as detailed in the description) and the ordering criteria so it was order by event_tag.event_id. That new change in the ordering criteria made no impact and the execution time remained stable around the 18ms (+/-2ms):

Execution Time	1st query (151815-153815)	2nd query (141815-153815)	3rd query (15-153815)
original slick	~50ms	~450ms	~18ms
modify filter	~17.5ms	~20.4ms	~18.7ms
modify filter and ordering	~16.6ms	~21.2ms	~20.1ms

Details

A somewhat regular query:

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 151815)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x3."event_id" limit 500;

                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..379.30 rows=45 width=186) (actual time=0.063..14.669 rows=402 loops=1)
   ->  Nested Loop  (cost=0.84..379.30 rows=45 width=186) (actual time=0.051..10.953 rows=402 loops=1)
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..7.61 rows=45 width=8) (actual time=0.019..2.092 rows=402 loops=1)
               Index Cond: ((event_id > 151815) AND (event_id <= 153815) AND (tag = 'carts-0'::text))
               Heap Fetches: 0
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..8.26 rows=1 width=178) (actual time=0.007..0.007 rows=1 loops=402)
               Index Cond: (ordering = x3.event_id)
 Planning Time: 0.215 ms
 Execution Time: 16.660 ms
(9 rows)

Then we try a big offset and equal limit.

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 141815)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x3."event_id" limit 500;

                                                                         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..2038.09 rows=271 width=186) (actual time=0.067..18.823 rows=500 loops=1)
   ->  Nested Loop  (cost=0.84..2038.09 rows=271 width=186) (actual time=0.055..14.065 rows=500 loops=1)
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..43.53 rows=271 width=8) (actual time=0.024..2.660 rows=500 loops=1)
               Index Cond: ((event_id > 141815) AND (event_id <= 153815) AND (tag = 'carts-0'::text))
               Heap Fetches: 0
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..7.36 rows=1 width=178) (actual time=0.008..0.008 rows=1 loops=500)
               Index Cond: (ordering = x3.event_id)
 Planning Time: 0.215 ms
 Execution Time: 21.266 ms
(9 rows)

This is now a query for a completely new projection that has the full journal to catch up on:

explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 15)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x3."event_id" limit 500;


                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.13..219.43 rows=500 width=186) (actual time=0.229..17.719 rows=500 loops=1)
   ->  Merge Join  (cost=1.13..13757.42 rows=31508 width=186) (actual time=0.217..13.210 rows=500 loops=1)
         Merge Cond: (x2.ordering = x3.event_id)
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..8426.16 rows=153544 width=178) (actual time=0.014..2.956 rows=515 loops=1)
         ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.42..4595.69 rows=31508 width=8) (actual time=0.025..2.509 rows=500 loops=1)
               Index Cond: ((event_id > 15) AND (event_id <= 153815) AND (tag = 'carts-0'::text))
               Heap Fetches: 0
 Planning Time: 0.220 ms
 Execution Time: 20.184 ms
(9 rows)

Jul 15 '21 11:07 ignasi35

MENTAL NOTE I suspect the reason for the observed behavior is that the primary key on event_tag is defined as PRIMARY KEY(event_id, tag) (that is event_id first). I wonder if we would achieve similar improvements without changing the SQL query and only changing the order of the columns in the PK to PRIMARY KEY(tag, event_id).

We might avoid the sequential scan in the very slow (400-500ms) case:

->  Seq Scan on event_tag x3  (cost=0.00..2740.32 rows=31560 width=8) (actual time=0.012..156.151 rows=31132 loops=1)
      Filter: ((tag)::text = 'carts-0'::text)
      Rows Removed by Filter: 122683

Jul 15 '21 11:07 ignasi35

Oh, I posted my comment before reading your last comment.

Yes, the order in the index matters. And indeed, we have already an index because of the primary key.

I would not make it a primary key though. Although it is supposed to be unique, we can trust that we will produce unique tags per event. We may just make a index on (tag, eventId).

Jul 15 '21 11:07 octonato

We may just make a index on (tag, eventId).

Won't that be worse? The cardinality on tag is very low compared to the cardinality on event_id.

Jul 15 '21 13:07 ignasi35

I tried it out. Dropping the PK and using the index on tag, event_id doesn't seem to help:


shopping-cart=# VACUUM ANALYZE;
VACUUM
shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 141815)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;
                                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..1846.87 rows=500 width=178) (actual time=1341.616..1377.089 rows=500 loops=1)
   ->  Merge Join  (cost=0.84..8939.31 rows=2421 width=178) (actual time=1341.604..1372.965 rows=500 loops=1)
         Merge Cond: (x2.ordering = x3.event_id)
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..8429.62 rows=153815 width=178) (actual time=0.016..722.778 rows=144321 loops=1)
         ->  Index Only Scan using tags_tag_ordering_index on event_tag x3  (cost=0.42..94.89 rows=2421 width=8) (actual time=0.057..2.471 rows=500 loops=1)
               Index Cond: ((tag = 'carts-0'::text) AND (event_id > 141815) AND (event_id <= 153815))
               Heap Fetches: 0
 Planning Time: 0.379 ms
 Execution Time: 1379.280 ms
(9 rows)

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x3."event_id" > 141815)
                   and (x3."event_id" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;
                                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..1846.87 rows=500 width=178) (actual time=1341.304..1379.648 rows=500 loops=1)
   ->  Merge Join  (cost=0.84..8939.31 rows=2421 width=178) (actual time=1341.292..1375.034 rows=500 loops=1)
         Merge Cond: (x2.ordering = x3.event_id)
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..8429.62 rows=153815 width=178) (actual time=0.017..728.635 rows=144321 loops=1)
         ->  Index Only Scan using tags_tag_ordering_index on event_tag x3  (cost=0.42..94.89 rows=2421 width=8) (actual time=0.049..2.387 rows=500 loops=1)
               Index Cond: ((tag = 'carts-0'::text) AND (event_id > 141815) AND (event_id <= 153815))
               Heap Fetches: 0
 Planning Time: 0.226 ms
 Execution Time: 1382.124 ms
(9 rows)

shopping-cart=# explain analyse select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."writer",
    x2."write_timestamp",
    x2."event_payload",
    x2."event_ser_id",
    x2."event_ser_manifest"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = 'carts-0')
           and (
                   (x2."ordering" > 141815)
                   and (x2."ordering" <= 153815))
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;
                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.95..393.15 rows=500 width=178) (actual time=248.595..284.385 rows=500 loops=1)
   ->  Merge Join  (cost=0.95..1877.26 rows=2392 width=178) (actual time=248.585..280.025 rows=500 loops=1)
         Merge Cond: (x2.ordering = x3.event_id)
         ->  Index Scan using event_journal_ordering_idx on event_journal x2  (cost=0.42..717.04 rows=11821 width=178) (actual time=0.020..13.114 rows=2506 loops=1)
               Index Cond: ((ordering > 141815) AND (ordering <= 153815))
         ->  Index Only Scan using tags_tag_ordering_index on event_tag x3  (cost=0.42..1029.06 rows=31122 width=8) (actual time=0.029..127.877 rows=29233 loops=1)
               Index Cond: (tag = 'carts-0'::text)
               Heap Fetches: 0
 Planning Time: 0.197 ms
 Execution Time: 286.567 ms
(10 rows)

shopping-cart=# \d event_table;
Did not find any relation named "event_table".
shopping-cart=# \d event_tag;
                      Table "public.event_tag"
  Column  |          Type          | Collation | Nullable | Default
----------+------------------------+-----------+----------+---------
 event_id | bigint                 |           | not null |
 tag      | character varying(256) |           | not null |
Indexes:
    "tags_tag_ordering_index" btree (tag, event_id)
Foreign-key constraints:
    "fk_event_journal" FOREIGN KEY (event_id) REFERENCES event_journal(ordering) ON DELETE CASCADE

shopping-cart=# \d tags_tag_ordering_index ;
        Index "public.tags_tag_ordering_index"
  Column  |          Type          | Key? | Definition
----------+------------------------+------+------------
 tag      | character varying(256) | yes  | tag
 event_id | bigint                 | yes  | event_id
btree, for table "public.event_tag"

Jul 15 '21 14:07 ignasi35

hmm

I did a backup/restore of the database and I can no longer reproduce these numbers:

Execution Time	1st query (151815-153815)	2nd query (141815-153815)	3rd query (15-153815)
original slick	~50ms	~450ms	~18ms
modify filter	~17.5ms	~20.4ms	~18.7ms

Table 1 Execution times for the queries presented in the description of the PR over a database created by inserting 150k events from the application on a bash+gcurl for loop

Instead, I get:

Execution Time	1st query (151815-153815)	2nd query (141815-153815)	3rd query (15-153815)
original slick	~350ms	~280ms	~18ms
modify filter	~17ms	~1300ms	~19ms

Table 2: Execution times for the queries presented in the description of the PR over a database restored from a backup (the backup was generated from the database used to generate the data in Table 1 above).

I'd like someone else's input on the benchmark. I can help set up the database with a sample project and all.

Jul 15 '21 14:07 ignasi35

I did a backup/restore of the database and I can no longer reproduce these numbers:

And yet, if I recreate the database organically again I get the results on the description.

I'm guessing the issue is related to the order in which data is written. When creating the database organically, events, tags, projected data and offset data are all accessed and modified randomly. Instead, when restoring the database, data is restored in order by table (eg, all events first, then tags, etc...) resulting in different data layout on disk, hence different query plans.

I don't know...

Jul 15 '21 18:07 ignasi35

I've slept on this one for a bit and I'd like to push it forward.

Maybe what we can do is hide the change behind a config setting (e.g. akka.persistence.jdbc.experimental.performance.queryByTag.prefer-tag-index = true) so users can try it out and report back with real-world metrics re the impact of this change on different datasets and machines.

Something like https://github.com/ignasi35/akka-persistence-jdbc/pull/1

Jul 20 '21 10:07 ignasi35

I prefer not to add any flag, that will only increase the complexity. And since we have been discussing about review that whole tagging / slicing thing, I don't think it's worth.

Jul 21 '21 08:07 octonato

I think this can be pushed forward as is, then.

Sep 05 '21 19:09 ignasi35

I don't see the same results, or I missed something.

I created the following (simplified) tables and populated with 3 million rows:

CREATE TABLE IF NOT EXISTS public.event_journal(
  ordering BIGSERIAL,
  persistence_id VARCHAR(255) NOT NULL,
  sequence_number BIGINT NOT NULL,
  event_payload VARCHAR(255) NOT NULL,

  PRIMARY KEY(persistence_id, sequence_number)
);

CREATE UNIQUE INDEX event_journal_ordering_idx ON public.event_journal(ordering);

CREATE TABLE IF NOT EXISTS public.event_tag(
    event_id BIGINT,
    tag VARCHAR(256),
    PRIMARY KEY(event_id, tag),
    CONSTRAINT fk_event_journal
      FOREIGN KEY(event_id)
      REFERENCES event_journal(ordering)
      ON DELETE CASCADE
);

INSERT INTO event_journal SELECT n, 'pid-' || n, 1,  'evt-1' FROM generate_series(1, 1000000) as n;
INSERT INTO event_journal SELECT 1000000 + n, 'pid-' || n, 2,  'evt-2' FROM generate_series(1, 1000000) as n;
INSERT INTO event_journal SELECT 2000000 + n, 'pid-' || n, 3,  'evt-3' FROM generate_series(1, 1000000) as n;

INSERT INTO event_tag SELECT n, mod(n, 256)::bit(8) FROM generate_series(1, 1000000) as n;
INSERT INTO event_tag SELECT 1000000 + n, mod(n, 256)::bit(8) FROM generate_series(1, 1000000) as n;
INSERT INTO event_tag SELECT 2000000 + n, mod(n, 256)::bit(8) FROM generate_series(1, 1000000) as n;

The original query:

explain analyze
select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."event_payload"
from "event_journal" x2, "event_tag" x3
where (
        (x3."tag" = '00000110')
           and (
                   (x2."ordering" > 2999000)
                   and (x2."ordering" <= 3000000))
          )
    and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;

Limit  (cost=45935.14..45935.38 rows=2 width=32) (actual time=88.859..90.649 rows=4 loops=1)
   ->  Gather Merge  (cost=45935.14..45935.38 rows=2 width=32) (actual time=88.816..90.512 rows=4 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Sort  (cost=44935.12..44935.13 rows=1 width=32) (actual time=84.768..84.821 rows=1 loops=3)
               Sort Key: x2.ordering
               Sort Method: quicksort  Memory: 25kB
               Worker 0:  Sort Method: quicksort  Memory: 25kB
               Worker 1:  Sort Method: quicksort  Memory: 25kB
               ->  Nested Loop  (cost=0.43..44935.11 rows=1 width=32) (actual time=78.235..84.761 rows=1 loops=3)
                     ->  Parallel Seq Scan on event_journal x2  (cost=0.00..43748.00 rows=272 width=32) (actual time=74.138..76.791 rows=333 loops=3)
                           Filter: ((ordering > 2999000) AND (ordering <= 3000000))
                           Rows Removed by Filter: 999667
                     ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.43..4.36 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=1000)
                           Index Cond: ((event_id = x2.ordering) AND (tag = '00000110'::text))
                           Heap Fetches: 0
 Planning Time: 0.207 ms
 Execution Time: 90.734 ms

Changing the query to what I think is suggested in this PR:

explain analyze
select x2."ordering",
   x2."persistence_id",
   x2."sequence_number",
   x2."event_payload"
from "event_journal" x2, "event_tag" x3
where (
       (x3."tag" = '00000110')
          and (
                  (x3."event_id" > 2999000)
                  and (x3."event_id" <= 3000000))
         )
   and (x2."ordering" = x3."event_id")
order by x2."ordering" limit 500;

Limit  (cost=41816.87..41816.88 rows=4 width=32) (actual time=12509.231..12510.244 rows=4 loops=1)
  ->  Sort  (cost=41816.87..41816.88 rows=4 width=32) (actual time=12509.192..12510.143 rows=4 loops=1)
        Sort Key: x2.ordering
        Sort Method: quicksort  Memory: 25kB
        ->  Gather  (cost=1037.18..41816.83 rows=4 width=32) (actual time=12504.713..12510.019 rows=4 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Hash Join  (cost=37.18..40816.43 rows=2 width=32) (actual time=12502.565..12505.208 rows=1 loops=3)
                    Hash Cond: (x2.ordering = x3.event_id)
                    ->  Parallel Seq Scan on event_journal x2  (cost=0.00..37498.00 rows=1250000 width=32) (actual time=0.033..6249.080 rows=1000000 loops=3)
                    ->  Hash  (cost=37.13..37.13 rows=4 width=8) (actual time=0.301..0.319 rows=4 loops=3)
                          Buckets: 1024  Batches: 1  Memory Usage: 9kB
                          ->  Index Only Scan using event_tag_pkey on event_tag x3  (cost=0.43..37.13 rows=4 width=8) (actual time=0.122..0.248 rows=4 loops=3)
                                Index Cond: ((event_id > 2999000) AND (event_id <= 3000000) AND (tag = '00000110'::text))
                                Heap Fetches: 0
Planning Time: 0.228 ms
Execution Time: 12510.329 ms

Sep 06 '21 11:09 patriknw

I find it strange that we don't used the primary key in the foreign key so I tried that instead:

CREATE TABLE IF NOT EXISTS public.event_journal2(
  ordering BIGSERIAL,
  persistence_id VARCHAR(255) NOT NULL,
  sequence_number BIGINT NOT NULL,
  event_payload VARCHAR(255) NOT NULL,

  PRIMARY KEY(persistence_id, sequence_number)
);

CREATE UNIQUE INDEX event_journal2_ordering_idx ON public.event_journal2(ordering);


CREATE TABLE IF NOT EXISTS public.event_tag2(
    persistence_id VARCHAR(255) NOT NULL,
    sequence_number BIGINT NOT NULL,
    tag VARCHAR(256),
    PRIMARY KEY(persistence_id, sequence_number, tag),
    CONSTRAINT fk_event_journal2
      FOREIGN KEY(persistence_id, sequence_number)
      REFERENCES event_journal2(persistence_id, sequence_number)
      ON DELETE CASCADE
);

INSERT INTO event_journal2 SELECT n, 'pid-' || n, 1,  'evt-1' FROM generate_series(1, 1000000) as n;
INSERT INTO event_journal2 SELECT 1000000 + n, 'pid-' || n, 2,  'evt-2' FROM generate_series(1, 1000000) as n;
INSERT INTO event_journal2 SELECT 2000000 + n, 'pid-' || n, 3,  'evt-3' FROM generate_series(1, 1000000) as n;

INSERT INTO event_tag2 SELECT 'pid-' || n, 1, mod(n, 256)::bit(8) FROM generate_series(1, 1000000) as n;
INSERT INTO event_tag2 SELECT 'pid-' || n, 2, mod(n, 256)::bit(8) FROM generate_series(1, 1000000) as n;
INSERT INTO event_tag2 SELECT 'pid-' || n, 3, mod(n, 256)::bit(8) FROM generate_series(1, 1000000) as n;

Query:

explain analyze
select x2."ordering",
    x2."persistence_id",
    x2."sequence_number",
    x2."event_payload"
from "event_journal2" x2, "event_tag2" x3
where (
        (x3."tag" = '00000110')
           and (
                   (x2."ordering" > 2999000)
                   and (x2."ordering" <= 3000000))
          )
    and (x2."persistence_id" = x3."persistence_id")
    and (x2."sequence_number" = x3."sequence_number")
order by x2."ordering" limit 500;


 Limit  (cost=0.86..4645.03 rows=4 width=32) (actual time=8.538..46.870 rows=4 loops=1)
   ->  Nested Loop  (cost=0.86..4645.03 rows=4 width=32) (actual time=8.520..46.711 rows=4 loops=1)
         ->  Index Scan using event_journal2_ordering_idx on event_journal2 x2  (cost=0.43..45.45 rows=1051 width=32) (actual time=0.019..9.512 rows=1000 loops=1)
               Index Cond: ((ordering > 2999000) AND (ordering <= 3000000))
         ->  Index Only Scan using event_tag2_pkey on event_tag2 x3  (cost=0.43..4.38 rows=1 width=18) (actual time=0.018..0.018 rows=0 loops=1000)
               Index Cond: ((persistence_id = (x2.persistence_id)::text) AND (sequence_number = x2.sequence_number) AND (tag = '00000110'::text))
               Heap Fetches: 0
 Planning Time: 0.409 ms
 Execution Time: 46.957 ms

Sep 06 '21 11:09 patriknw

I don't see the same results, or I missed something.

I created the following (simplified) tables and populated with 3 million rows:

There are two steps I followed when running these benchmarks:

I used the exact schema from the version I was testing,
I inserted real data (not just synthesized rows) simulating a real world usage as much as possible.

Then, if I wanted to test a schema change, I avoided using ALTER TABLE or adding an index a posteriori, and then a new analysis for the indices to refresh. Instead, I started the whole new experiment from scratch. I even avoided using backup/restore tools as I noticed they also changed the results of my benchmarks.

A reason why I wanted to use real-looking load was, not only so data on the table would fill pages in a more realistic way (bigger payload, longer persistent ID,...), but also because rows of a given persitent Id would interleave with other rows a bit more unexpectedly. I used https://github.com/ignasi35/dynamic-projection-experiments/blob/main/runner.sh to generate that traffic (on the app in that same repo).

Also, to make things a bit more realistic, I also started a projection causing some extra load on the DB. Because why not.

I don't see the same results, or I missed something.

A reason why I stopped using backup/restore or alter schema approaches was that I would easily loose the capability to reproduce the benchmark results. See my comments dated Jul 15, 2021 (https://github.com/akka/akka-persistence-jdbc/pull/560#issuecomment-880750845 and later).

Sep 06 '21 12:09 ignasi35

@patriknw I've observed in some of your query plans your DB spawns multiple workers which makes me suspect we ran the benchmarks over different-sized databases. All I can remember is I used https://github.com/ignasi35/dynamic-projection-experiments/blob/main/docker-compose.yml without further tunning but I don't know how much power I had assigned to docker desktop at the moment of the benchmarking.

Another difference is that you used 3M synthetic events (rows in the journal) while in my tests I always used datasets of one fewer order of magnitude (150k-180k).

Sep 06 '21 13:09 ignasi35

Closing this since we don't have the bandwidth to investigate further right now and we haven't seen any reported problems.

Aug 16 '22 10:08 patriknw

akka-persistence-jdbc akka-persistence-jdbc copied to clipboard

Prefer filtering over columns on the tag table

Intro: the slick query for queryByTag

Analising how it performs

Proposed solution

Filter and order over a single index

akka-persistence-jdbc
akka-persistence-jdbc copied to clipboard