akka-persistence-jdbc
akka-persistence-jdbc copied to clipboard
Scalability issue - DAO do seq scan for eventByTag
Hello , we are building our product on lagom and one day suddenly we saw high CPU usage.
We are using lagom version 3.10
We found that it is because of eventsByTag query on Journal table which do a sequential scan. and if events in the table is more than 10 lakh , it will screw the system.
Is this the desired behavior or am i missing something basic here
is there any specific reason why are we using like in the tag query????
See #168 and #172 In order to support multiple tags, a poor design choice has been made to store all tags in a single column, and use a LIKE query to check whether the event has the tag.
In practice the queries often perform okay, since there should be a unique index on the ordering column.
I am investigating the performance impact of storing tags in a separate table. Otherwise a solution might be to allow a single tag only (see #172).
@WellingR in our case we are using Lagom and multiple PersistentEntities events are going to the same journal table.
Now as the query include ordering > something with tag like behaviour, postgres need to do a seq scan
every time . which kills the system even for a small number of events.
i am little skeptical about queries often perform okay.
Also can you confirm that Lagom uses only Single Tag feature, than we can contribute the lagom optimized flag which change LIKE to =
@WellingR also i think for supporting multiple tags in the same table we could try making tag column a JSONB list , then create a gin index on top of it. This might scale better with postgres 9.4+
@WellingR also i think for supporting multiple tags in the same table we could try making tag column a JSONB list , then create a gin index on top of it. This might scale better with postgres 9.4+
Completely agree. Fortunately, you can override journal's scheme and queries as you need by providing custom DAO stuff. Some time ago (may be a year) a've faced with the same problems (query performance and tag prefix sharing problem). The solution was quite simple:
- altenative schema (changed tags column type from varchar to jsonb to keep array, added gin index)
- altenative DAO (changed queries, little coding, little configuring)
- simple data migration (old journal left unchanged, made a new one)
I am working on an implementation where tags are stored in a separate table, however I had some trouble getting all tests green again, also I have not yet investigated if this significantly hurts the insertion performance.
@two10 Thanks for the suggestion to use a JSONB list with GIN index. I should see if something like this is possible in the other database types that we support.
I hope that we can properly resolve this issue once and for all in akka-persistence-jdbc 4.0.0 (including multiple tag support) but it might take a while before it is ready.
Until that point, a fix for #172 would be welcome as a workaround.
I can confirm that Lagom only adds a single tag per event (at most).
I've worked around this in PostgreSQL by modifying the schema slightly, by adding an index. This relies on the fact that you can create indexes on expressions in Postgres. So, we add a GIN index on the tags, split into a text
array:
CREATE INDEX split_journal_tags_index
ON journal
USING GIN (string_to_array("tags", ','));
Then, we can override the eventsByTag
method in the standard DAO to use the GIN index array subset operator for the query.
class TagArrayReadJournalDao(
db: Database,
profile: JdbcProfile,
readJournalConfig: ReadJournalConfig,
serialization: Serialization
)(implicit
ec: ExecutionContext,
mat: Materializer
) extends ByteArrayReadJournalDao(
db: Database,
profile: JdbcProfile,
readJournalConfig: ReadJournalConfig,
serialization: Serialization
) {
import profile.api._
import readJournalConfig.journalTableConfiguration.columnNames._
override def eventsByTag(
tag: String,
offset: Long, maxOffset: Long,
max: Long
): Source[Try[(PersistentRepr, Set[String], Long)], NotUsed] = {
Source.fromPublisher {
db.stream(
sql"""
SELECT "#$ordering", "#$deleted", "#$persistenceId", "#$sequenceNumber", "#$message", "#$tags"
FROM #$theTableName
WHERE string_to_array("#$tags", ',') @> array[$tag]::text[]
AND "#$ordering" > $offset
AND "#$ordering" <= $maxOffset
ORDER BY "#$ordering"
LIMIT $max
""".as[JournalRow]
)
}.via(serializer.deserializeFlow)
}
}
I'm not sure this is the best way to do this and I haven't tested it much, but my database load seems to have dropped to about a quarter of what it was.