akka-persistence-jdbc icon indicating copy to clipboard operation
akka-persistence-jdbc copied to clipboard

Scalability issue - DAO do seq scan for eventByTag

Open two10 opened this issue 6 years ago • 8 comments

Hello , we are building our product on lagom and one day suddenly we saw high CPU usage.

We are using lagom version 3.10

We found that it is because of eventsByTag query on Journal table which do a sequential scan. and if events in the table is more than 10 lakh , it will screw the system.

Is this the desired behavior or am i missing something basic here

two10 avatar Jun 29 '18 15:06 two10

is there any specific reason why are we using like in the tag query????

two10 avatar Jul 03 '18 07:07 two10

See #168 and #172 In order to support multiple tags, a poor design choice has been made to store all tags in a single column, and use a LIKE query to check whether the event has the tag.

In practice the queries often perform okay, since there should be a unique index on the ordering column.

I am investigating the performance impact of storing tags in a separate table. Otherwise a solution might be to allow a single tag only (see #172).

WellingR avatar Jul 03 '18 19:07 WellingR

@WellingR in our case we are using Lagom and multiple PersistentEntities events are going to the same journal table.

Now as the query include ordering > something with tag like behaviour, postgres need to do a seq scan every time . which kills the system even for a small number of events.

i am little skeptical about queries often perform okay.

Also can you confirm that Lagom uses only Single Tag feature, than we can contribute the lagom optimized flag which change LIKE to =

two10 avatar Jul 04 '18 03:07 two10

@WellingR also i think for supporting multiple tags in the same table we could try making tag column a JSONB list , then create a gin index on top of it. This might scale better with postgres 9.4+

two10 avatar Jul 04 '18 03:07 two10

@WellingR also i think for supporting multiple tags in the same table we could try making tag column a JSONB list , then create a gin index on top of it. This might scale better with postgres 9.4+

Completely agree. Fortunately, you can override journal's scheme and queries as you need by providing custom DAO stuff. Some time ago (may be a year) a've faced with the same problems (query performance and tag prefix sharing problem). The solution was quite simple:

  • altenative schema (changed tags column type from varchar to jsonb to keep array, added gin index)
  • altenative DAO (changed queries, little coding, little configuring)
  • simple data migration (old journal left unchanged, made a new one)

robot-govorun avatar Jul 04 '18 05:07 robot-govorun

I am working on an implementation where tags are stored in a separate table, however I had some trouble getting all tests green again, also I have not yet investigated if this significantly hurts the insertion performance.

@two10 Thanks for the suggestion to use a JSONB list with GIN index. I should see if something like this is possible in the other database types that we support.

I hope that we can properly resolve this issue once and for all in akka-persistence-jdbc 4.0.0 (including multiple tag support) but it might take a while before it is ready.

Until that point, a fix for #172 would be welcome as a workaround.

WellingR avatar Jul 04 '18 06:07 WellingR

I can confirm that Lagom only adds a single tag per event (at most).

TimMoore avatar Jul 06 '18 08:07 TimMoore

I've worked around this in PostgreSQL by modifying the schema slightly, by adding an index. This relies on the fact that you can create indexes on expressions in Postgres. So, we add a GIN index on the tags, split into a text array:

CREATE INDEX split_journal_tags_index
  ON journal
  USING GIN (string_to_array("tags", ','));

Then, we can override the eventsByTag method in the standard DAO to use the GIN index array subset operator for the query.

class TagArrayReadJournalDao(
  db: Database,
  profile: JdbcProfile,
  readJournalConfig: ReadJournalConfig,
  serialization: Serialization
)(implicit
  ec: ExecutionContext,
  mat: Materializer
) extends ByteArrayReadJournalDao(
  db: Database,
  profile: JdbcProfile,
  readJournalConfig: ReadJournalConfig,
  serialization: Serialization
) {
  import profile.api._
  import readJournalConfig.journalTableConfiguration.columnNames._

  override def eventsByTag(
    tag: String,
    offset: Long, maxOffset: Long,
    max: Long
  ): Source[Try[(PersistentRepr, Set[String], Long)], NotUsed] = {
    Source.fromPublisher {
      db.stream(
        sql"""
          SELECT "#$ordering", "#$deleted", "#$persistenceId", "#$sequenceNumber", "#$message", "#$tags"
          FROM #$theTableName
          WHERE string_to_array("#$tags", ',') @> array[$tag]::text[]
          AND "#$ordering" > $offset
          AND "#$ordering" <= $maxOffset
          ORDER BY "#$ordering"
          LIMIT $max
        """.as[JournalRow]
      )
    }.via(serializer.deserializeFlow)
  }
}

I'm not sure this is the best way to do this and I haven't tested it much, but my database load seems to have dropped to about a quarter of what it was.

TJSomething avatar Jan 19 '19 08:01 TJSomething