akka-projection
akka-projection copied to clipboard
Fan out single query stream to multiple projections
Akka persistence query implementations typically poll the underlying datastore for updates. If using sharded tags, and many projections, you get a product of your shards and projections, and very soon the load generated by this polling can become non trivial, in some cases executing 100s of queries per second on the database when the system is idle.
A solution to this that we've implemented in Cloudstate is to share projection query streams between multiple projections. So for example, let's say you have a user entity, and a user by email projection, and a user by username projection. We have two projections there, each with their own offset tracking. But we only execute one query on the user eventsByQuery tag, and then in Akka streams, we broadcast the events to both. Since both have their own offset tracking, we need to handle that. When starting the projection, we query the offset for each, and take the lowest, and start the query from that point. Then we dropWhile
in each handler until we've reached the offset it was up to. We also define a buffer, which ensures if a given projection lags a little, it doesn't slow the other projections down until the buffer is full.
By taking this approach, we can safely and confidently add as many projections as we want, without worrying about the idle load we're putting on the database of all those polling queries. From my reading of the API (I could be wrong) this approach would require a new API in Akka projections to do the fan out.
An alternative might be to have a single projection handler that fans out the event handling to update multiple representations of the state, and they would all share the same offset tracking. But this tight coupling between projections can be problematic. Separate offset tracking means we can independently control them, for example we can reset the offset of one of them to rebuild it without having to rebuild all of the projections. We can also have multiple query instances defined, trivially move the projections between query instances, for example if we find one handler is expensive to run and tends to lag behind the others and slow them down, we can simply move it to its own query runner, without needing to update the offset tracking or anything. We also use multiple different query runners for different projection reliability characteristics, some projections update the local database, so we group all of them together, while others update resources in different datacentres, if there's a partition between the datacentres, running them in a separate runner ensures that that partition doesn't affect the projections in the local database.
I think this is an interesting optimization. It's not planned for Akka Projection MVP, but we can always discuss priorities.
As James has described on a Cloudstatemachine issue, this is something we would like to have on the long run, but should not be urgent.