sequent icon indicating copy to clipboard operation
sequent copied to clipboard

Need to use symmetric encryption of event data for GDPR to make an aggregate forgettable

Open mattgibson opened this issue 5 years ago • 8 comments

I intend to have a go at implementing this feature, so I'm opening this issue to start a discussion.

To meet the requirements of the GDPR, a user needs to be able to be forgotten permanently. However, it is ideally best never to alter the event store so that the data remains immutable as far as possible. I think a good approach to this would be to optionally use a per-aggregate key to encrypt some fields before they are stored in the event_json field of the event_records table. The same key can then decrypt the on the way out again and to forget the data, the key can be deleted rather than the events.

I'm not sure of the best place to put the code, but I think encryption should happen just the event is written, and decryption should happen transparently when it's read back into an aggregate later. Keys could be generated automatically when first needed and stored in a new table, in a row that includes the encryption key and the aggregate id.

I think the table for the keys would make most sense in the sequent_schema, perhaps called gdpr_keys. Or perhaps it could just be an optional field on the stream_records table?

I imagine the code interface looking something like this:

class User < Sequent::AggregateRoot
  forgettable

  def initialize()
    # ...
  end 
end

With key deletion happening via a pre-defined event:

class DeleteUser < Sequent::Event
  forgets User

  # ...
end

Keen to hear other people's thoughts and suggestions!

mattgibson avatar Jan 20 '20 23:01 mattgibson

Hi @mattgibson,

My thoughts:

We comply to GDPR as follows: our application is tenant based. Whenever a tenant is deleted we delete all aggregates belonging to that tenant from the event store after 30 days. We keep track to which tenant an aggregate (and its events) belongs. But ideally yes, the event store should remain as immutable as possible. In this case however we found this the most pragmatic and simple, though brute force, approach: We do not have to think about what is actually in the GDPR or not, we just delete everything. Since we are interested in what tenants actually did during their subscription we do backup all command_types and event_types the tenant did in a separate table before deleting. This is completely anonymous since it is only uuid's with event names. Also in our terms we state that we only delete after 30 days, this is mainly to provide an undo option. Because people actually do delete their account by accident, (even with big red DANGER-DANGER-are-you-really-sure-type-the-name-of-what-you-want-to-delete buttons).

I agree that changing the event store is not ideal, so encryption and throwing away the key could also be provided as option if you really do not what to mess with the contents of the event store. This could be implemented by providing another implementation of the Sequent::Core::SerializesEvent. A separate table would have my preference: stream_encryption_keys or something (I would not put GDPR in the name).

Some things to keep in mind when implementing:

  • You need to exclude the "deleted aggregates" when replaying events to build up projections. So the stream_encryption_keys should probably have two columns: one for the keys, and one for if the stream is/was actually encrypted or not: encryption_key and encrypted_stream. This way we can skip the streams of everything where encrypted_stream is true and encryption_key is null.
  • Also there will be an impact on performance if you constantly need to decrypt events from aggregates when replaying. This of course only occurs when you rebuild a projection or change state of an aggregate
class DeleteUser < Sequent::Event
  forgets User

  # ...
end

This would create a bi-directional relation between User and DeleteUser event. So I would rather have something like:

class User < Sequent::AggregateRoot
  forgettable

  on UserDeleted do
    forget_me # or something with a better name
  end

end

While writing this I am thinking that another option would be to only encrypt on deletion. This way there is no performance penalty when replaying events:

class User < Sequent::AggregateRoot
  forgettable

  on UserDeleted do
    forget_me # or something with a better name
  end

end

When forget_me is invoked the entire stream will be encrypted with a one-time key and a record in a table called forgotten_streams is written`.

@erikrozendaal any thoughts?

lvonk avatar Jan 21 '20 07:01 lvonk

About:

You need to exclude the "deleted aggregates" when replaying events to build up projections.

This is not always the case, unfortunately. In many domains and use-cases, one wishes to employ someting that is referred to as "Tombstoning".

A simple use-case would be the following.

Given an a published blog post by Bob the Blogger
And a comment "I did this too!" by Anne on that blogpost
And a reply to this "I did this too!" comment with "You did not, because I saw you yesterday" by Carol
When Anne decides to be forgotten
Then the comment "I did this too!" is replaced by a TombstoneComment, with content "\[deleted\]"
So that moderators, authors or other authorized people can delete comments without removing the context.

This mechanism also works when you replace "When Anne decides to be forgotten" with "When a moderator flags the content" or "When Anne deletes this comment".

A recent actual use-case: When I built a shared expenses app, we could not just delete User A after a GDPR-request, because that would invalidate all the expenses and financial transactions user A had with other, still active users. So we replaced all the attributes of objects that made up PII, with "deleted" or "anonymous" or other domain terms.

This "Tombstone" mechanism needs no additional work; it already is possible with generic event-sourced setups.

So I would think both the "exclude deleted aggregates" and the tombstoning should be possible with an implementation.

berkes avatar Oct 02 '20 08:10 berkes

So if I understand you correctly you replace events in the event store with so called Tombstone events? Or do you just add such an event to the end of the stream? Because in that case the Events still contain the data that should be deleted?

Of course you can not delete events that will break your app. So in the case of the shared expenses app you could also separate out the financial consequences from the actual expenses. This way the app will keep working even if you ~~delete~~ forget the expenses. And depending on how you model your Expense you don't even have to forget those as long as it is not traceable anymore to the deleted user. This way you still comply to GDPR iirc.

lvonk avatar Oct 02 '20 09:10 lvonk

So if I understand you correctly you replace events in the event store with so called Tombstone events? Or do you just add such an event to the end of the stream? Because in that case the Events still contain the data that should be deleted?

A Tobstoned event will be added to the event store. This merely results in the projection displaying a Tombstone instead of the original object/data.

This does, indeed, not solve a problem with the data living in the even store. For which encryption is a solution.

I merely brought this up here, because hardwiring something like "Undecryptable Events Are Ignored" might not work for all use-cases, as outlined above. Instead, on (replaying) an event that cannot be decrypted, it should i) still contain enough information in order to ii) be handled by the application with e.g. a tombstone, an ignore, delete or whatever a use-case needs.

berkes avatar Oct 02 '20 09:10 berkes

Ah thanks for clarifying. Perhaps we could/should add a section in the documentation on this topic on what possibilities you have in order to comply to GDPR:

  • Encryption
  • Tombstoning
  • Event deletion

Per possibility we could describe what the up- and downside and the consequences and how you could achieve this using Sequent.

Would that be useful @berkes @mattgibson ?

lvonk avatar Oct 02 '20 09:10 lvonk

Brute force deletion

Does this mean Brute force deletion of events? Or data in projections?

berkes avatar Oct 05 '20 07:10 berkes

Brute force deletion

Does this mean Brute force deletion of events? Or data in projections?

I have update the comments, I meant indeed Event Deletion.

lvonk avatar Oct 05 '20 19:10 lvonk

I think a docs section would be a great idea. Especially with example code. It's clearly not one-size-fits-all.

mattgibson avatar Nov 06 '20 23:11 mattgibson