sequent
sequent copied to clipboard
Need to use symmetric encryption of event data for GDPR to make an aggregate forgettable
I intend to have a go at implementing this feature, so I'm opening this issue to start a discussion.
To meet the requirements of the GDPR, a user needs to be able to be forgotten permanently. However, it is ideally best never to alter the event store so that the data remains immutable as far as possible. I think a good approach to this would be to optionally use a per-aggregate key to encrypt some fields before they are stored in the event_json
field of the event_records
table. The same key can then decrypt the on the way out again and to forget the data, the key can be deleted rather than the events.
I'm not sure of the best place to put the code, but I think encryption should happen just the event is written, and decryption should happen transparently when it's read back into an aggregate later. Keys could be generated automatically when first needed and stored in a new table, in a row that includes the encryption key and the aggregate id.
I think the table for the keys would make most sense in the sequent_schema
, perhaps called gdpr_keys
. Or perhaps it could just be an optional field on the stream_records
table?
I imagine the code interface looking something like this:
class User < Sequent::AggregateRoot
forgettable
def initialize()
# ...
end
end
With key deletion happening via a pre-defined event:
class DeleteUser < Sequent::Event
forgets User
# ...
end
Keen to hear other people's thoughts and suggestions!
Hi @mattgibson,
My thoughts:
We comply to GDPR as follows: our application is tenant based. Whenever a tenant is deleted we delete all aggregates belonging to that tenant from the event store after 30 days. We keep track to which tenant an aggregate (and its events) belongs. But ideally yes, the event store should remain as immutable as possible. In this case however we found this the most pragmatic and simple, though brute force, approach: We do not have to think about what is actually in the GDPR or not, we just delete everything. Since we are interested in what tenants actually did during their subscription we do backup all command_types and event_types the tenant did in a separate table before deleting. This is completely anonymous since it is only uuid's with event names. Also in our terms we state that we only delete after 30 days, this is mainly to provide an undo option. Because people actually do delete their account by accident, (even with big red DANGER-DANGER-are-you-really-sure-type-the-name-of-what-you-want-to-delete buttons).
I agree that changing the event store is not ideal, so encryption and throwing away the key could also be provided as option if you really do not what to mess with the contents of the event store.
This could be implemented by providing another implementation of the Sequent::Core::SerializesEvent
. A separate table would have my preference: stream_encryption_keys
or something (I would not put GDPR in the name).
Some things to keep in mind when implementing:
- You need to exclude the "deleted aggregates" when replaying events to build up projections. So the
stream_encryption_keys
should probably have two columns: one for the keys, and one for if the stream is/was actually encrypted or not:encryption_key
andencrypted_stream
. This way we can skip the streams of everything whereencrypted_stream
istrue
andencryption_key
isnull
. - Also there will be an impact on performance if you constantly need to decrypt events from aggregates when replaying. This of course only occurs when you rebuild a projection or change state of an aggregate
class DeleteUser < Sequent::Event
forgets User
# ...
end
This would create a bi-directional relation between User
and DeleteUser
event. So I would rather have something like:
class User < Sequent::AggregateRoot
forgettable
on UserDeleted do
forget_me # or something with a better name
end
end
While writing this I am thinking that another option would be to only encrypt on deletion. This way there is no performance penalty when replaying events:
class User < Sequent::AggregateRoot
forgettable
on UserDeleted do
forget_me # or something with a better name
end
end
When forget_me
is invoked the entire stream will be encrypted with a one-time key and a record in a table called forgotten_streams
is written`.
@erikrozendaal any thoughts?
About:
You need to exclude the "deleted aggregates" when replaying events to build up projections.
This is not always the case, unfortunately. In many domains and use-cases, one wishes to employ someting that is referred to as "Tombstoning".
A simple use-case would be the following.
Given an a published blog post by Bob the Blogger
And a comment "I did this too!" by Anne on that blogpost
And a reply to this "I did this too!" comment with "You did not, because I saw you yesterday" by Carol
When Anne decides to be forgotten
Then the comment "I did this too!" is replaced by a TombstoneComment, with content "\[deleted\]"
So that moderators, authors or other authorized people can delete comments without removing the context.
This mechanism also works when you replace "When Anne decides to be forgotten" with "When a moderator flags the content" or "When Anne deletes this comment".
A recent actual use-case: When I built a shared expenses app, we could not just delete User A after a GDPR-request, because that would invalidate all the expenses and financial transactions user A had with other, still active users. So we replaced all the attributes of objects that made up PII, with "deleted" or "anonymous" or other domain terms.
This "Tombstone" mechanism needs no additional work; it already is possible with generic event-sourced setups.
So I would think both the "exclude deleted aggregates" and the tombstoning should be possible with an implementation.
So if I understand you correctly you replace events in the event store with so called Tombstone events? Or do you just add such an event to the end of the stream? Because in that case the Events still contain the data that should be deleted?
Of course you can not delete events that will break your app. So in the case of the shared expenses app you could also separate out the financial consequences from the actual expenses. This way the app will keep working even if you ~~delete~~ forget the expenses. And depending on how you model your Expense you don't even have to forget those as long as it is not traceable anymore to the deleted user. This way you still comply to GDPR iirc.
So if I understand you correctly you replace events in the event store with so called Tombstone events? Or do you just add such an event to the end of the stream? Because in that case the Events still contain the data that should be deleted?
A Tobstoned event will be added to the event store. This merely results in the projection displaying a Tombstone instead of the original object/data.
This does, indeed, not solve a problem with the data living in the even store. For which encryption is a solution.
I merely brought this up here, because hardwiring something like "Undecryptable Events Are Ignored" might not work for all use-cases, as outlined above. Instead, on (replaying) an event that cannot be decrypted, it should i) still contain enough information in order to ii) be handled by the application with e.g. a tombstone, an ignore, delete or whatever a use-case needs.
Ah thanks for clarifying. Perhaps we could/should add a section in the documentation on this topic on what possibilities you have in order to comply to GDPR:
- Encryption
- Tombstoning
- Event deletion
Per possibility we could describe what the up- and downside and the consequences and how you could achieve this using Sequent.
Would that be useful @berkes @mattgibson ?
Brute force deletion
Does this mean Brute force deletion of events? Or data in projections?
Brute force deletion
Does this mean Brute force deletion of events? Or data in projections?
I have update the comments, I meant indeed Event Deletion.
I think a docs section would be a great idea. Especially with example code. It's clearly not one-size-fits-all.