Akka.Persistence.PostgreSql icon indicating copy to clipboard operation
Akka.Persistence.PostgreSql copied to clipboard

Use array datatype for tags storage and searching

Open CumpsD opened this issue 5 years ago • 14 comments

Apparently tags is a semicolon-separated list with a LIKE %% query against it, which is not that performant when you have millions of events.

PostgreSQL can turn tags into an array type and efficiently index it with a GIN index.

This PR turns the tags column in an array type and changes how tags are inserted and searched for.

CumpsD avatar Mar 02 '20 18:03 CumpsD

Looks like we need to update the build system here

Aaronontheweb avatar Mar 02 '20 19:03 Aaronontheweb

Where is it defined? :p

CumpsD avatar Mar 02 '20 19:03 CumpsD

@CumpsD Sorry, can you resolve the merge conflict?

Arkatufus avatar Mar 05 '21 17:03 Arkatufus

@Arkatufus I'll try to get around to it as fast as I can, somewhere next week

CumpsD avatar Mar 05 '21 19:03 CumpsD

@CumpsD a couple notes on this:

  • If this is a breaking change we will need a migration script/application to handle, and we should probably have proper documentation for users around how to use in the upgrade.

  • Would you be able to help work such a change into Akka.Persistence.Linq2Db? The hope is that we are able to keep backwards compatibility for people who want to migrate from existing journals, but from a perf standpoint it's much faster so we are hoping people will adopt. I can assist ofc.

to11mtm avatar Mar 06 '21 17:03 to11mtm

I did a performance test using 7 tags and 2 million records using the latest (13.2) PostgreSql docker image, the result is not encouraging.

Current implementation Persist()-ing 2000000 took 4020253.7091 ms Querying took 5390.3155 ms Querying took 5039.1806 ms Querying took 5053.6488 ms Querying took 5045.1138 ms Querying took 4984.1766 ms Querying took 5216.0742 ms Querying took 5095.1884 ms Querying took 4991.0002 ms Querying took 5712.957 ms Querying took 4915.7664 ms Query average time: 5144.34215 ms, 194388.31454863475 msg/sec Median time: 5049.3813 ms

==============================

Tag-array Persist()-ing 2000000 took 4046318.0437 ms Querying took 5587.1745 ms Querying took 6423.1314 ms Querying took 5028.1429 ms Querying took 5127.729 ms Querying took 5073.5635 ms Querying took 5310.665 ms Querying took 5388.9509 ms Querying took 5120.5908 ms Querying took 5128.799 ms Querying took 8580.9275 ms Query average time: 5676.967449999999 ms, 176150.38465651235 msg/sec Median time: 5219.732 ms

Arkatufus avatar Mar 08 '21 12:03 Arkatufus

If it helps to get an idea about scale, we have 200 million events :D

CumpsD avatar Mar 08 '21 12:03 CumpsD

There are a few outliers, but if you look at the median of both tests, you see that there are no significant improvement in performance between the 2 implementation.

Arkatufus avatar Mar 08 '21 12:03 Arkatufus

@CumpsD feel free to clone #81 and run a 200 million event test on it, my educated guess is that there wont be any significant improvement over the query.

Arkatufus avatar Mar 12 '21 23:03 Arkatufus

@Arkatufus just wanted to let you know I have not lost sight of this, will do when I find some time in the coming week(s)

CumpsD avatar Mar 16 '21 17:03 CumpsD

Interesting - is this still on the table or should we be looking at Linq2Db instead?

Aaronontheweb avatar Jun 16 '22 14:06 Aaronontheweb

Interesting - is this still on the table or should we be looking at Linq2Db instead?

There's advantages and drawbacks to Array datatype.

Obviously, it simplifies parts of the query pipeline, as you're able to write everything into one row and not interleave with tables.

The drawback is (AFAIK) Indexes on arrays are GIN (Generalized inverted) indexes. These can have different performance characteristics than BTrees on inserts and updates. BUT IIRC they can be more size efficient than BTree.

(FWIW, this would be easy to add as an option to Persistence.Linq2Db, Yet another tagmode flag 😅 .)

Side note regarding query performance: It may be prudent to flush Postgres' (and other DBs) internal caches for an operation like this, at least for some scenarios. If rows are recently inserted and pages fit into memory, unless there are a -lot- of tags I wouldn't expect to see much difference in perf.

to11mtm avatar Jun 16 '22 15:06 to11mtm

Sounds like we're better off sticking to Linq2Db :p

Aaronontheweb avatar Jun 16 '22 15:06 Aaronontheweb

I never got around to testing it because I'm not using it anymore. But purely postgress wise, this made more sense :)

CumpsD avatar Jun 23 '22 13:06 CumpsD