core-java icon indicating copy to clipboard operation
core-java copied to clipboard

A snapshot of an aggregate may be corrupted due to eventually consistent history

Open dmytro-grankin opened this issue 6 years ago • 2 comments

Currently, it is possible, that a corrupted aggregate snapshot will be written to a storage.

The problem happens when an eventually consistent storage is used (e.g. Datastore).

Let's look at the problem using a task creation example. A normal scenario is the following: TaskCreated and TaskAssigned events are already stored in the aggregate history. Then, the aggregate is loaded and the events are played, StartTask command is dispatched and TaskStarted event is emitted and applied to the aggregate. When the aggregate is stored, the snapshot trigger is reached and a snapshot is stored (see the picture below).

image

But, after an event was stored in an aggregate history, it may be unavailable during the next history read operations. A problem scenario: TaskCreated and TaskAssigned events are already stored in the aggregate history. Then, the aggregate is loaded and only TaskCreated event is played; TaskAssigned event is not returned from the history backward due to eventual consistency. StartTask command is dispatched and TaskStarted event is emitted and applied to the aggregate. When the aggregate is stored, the snapshot trigger is reached and a snapshot is stored. But, because TaskAssigned event was not available and hence not played before applying of TaskStarted, the task snapshot has a missing assignee (see the picture below).

image

In other words, if the problem happens, the following is true: the number of played events (excluding a snapshot) don't equal to the event count after the last snapshot (AggregateStorage.readEventCountAfterLastSnapshot(...)).

Also, when using Datastore, it is possible, that an event from the middle of an aggregate history won't be available. So, the fix should take this into account.

The framework should not store a wrong snapshot as in the example above or somehow deal with the eventual consistency of an aggregate history.

dmytro-grankin avatar Sep 04 '18 14:09 dmytro-grankin

Hopefully this is going to be addressed by #1259. @armiol is adjusting the way snapshots are made.

alexander-yevsyukov avatar Apr 10 '20 19:04 alexander-yevsyukov

After some discussion with @armiol we decided to postpone the fix. We cannot fix it under 1.x without significant performance penalty. The problem does not manifest often. The delay with the fix is not going to impact the current users of the framework.

alexander-yevsyukov avatar Sep 02 '20 12:09 alexander-yevsyukov