core-java
core-java copied to clipboard
A snapshot of an aggregate may be corrupted due to eventually consistent history
Currently, it is possible, that a corrupted aggregate snapshot will be written to a storage.
The problem happens when an eventually consistent storage is used (e.g. Datastore).
Let's look at the problem using a task creation example. A normal scenario is the following: TaskCreated
and TaskAssigned
events are already stored in the aggregate history. Then, the aggregate is loaded and the events are played, StartTask
command is dispatched and TaskStarted
event is emitted and applied to the aggregate. When the aggregate is stored, the snapshot trigger is reached and a snapshot is stored (see the picture below).
But, after an event was stored in an aggregate history, it may be unavailable during the next history read operations. A problem scenario: TaskCreated
and TaskAssigned
events are already stored in the aggregate history. Then, the aggregate is loaded and only TaskCreated
event is played; TaskAssigned
event is not returned from the history backward due to eventual consistency. StartTask
command is dispatched and TaskStarted
event is emitted and applied to the aggregate. When the aggregate is stored, the snapshot trigger is reached and a snapshot is stored. But, because TaskAssigned
event was not available and hence not played before applying of TaskStarted
, the task snapshot has a missing assignee (see the picture below).
In other words, if the problem happens, the following is true: the number of played events (excluding a snapshot) don't equal to the event count after the last snapshot (AggregateStorage.readEventCountAfterLastSnapshot(...)
).
Also, when using Datastore, it is possible, that an event from the middle of an aggregate history won't be available. So, the fix should take this into account.
The framework should not store a wrong snapshot as in the example above or somehow deal with the eventual consistency of an aggregate history.
Hopefully this is going to be addressed by #1259. @armiol is adjusting the way snapshots are made.
After some discussion with @armiol we decided to postpone the fix. We cannot fix it under 1.x without significant performance penalty. The problem does not manifest often. The delay with the fix is not going to impact the current users of the framework.