eventuous icon indicating copy to clipboard operation
eventuous copied to clipboard

Archiving streams

Open alexeyzimarev opened this issue 3 years ago • 15 comments

In order to keep the database size in an event store upper bound, it would be useful to bake the archiving function.

The archive storage could be anything, so it needs an abstraction. The functionality would be limited to a stream archive function in a separate class that needs a dependency on IEventStore and the archive interface.

Proposed functions:

  • ArchiveStream would take a given stream, read all the events from there, serialize to an array of object and push everything as a single object to the archive
  • RestoreStream would do the opposite, given that we know the types of all the archived events

Consider:

  • Store event type for each event in the archive envelope
  • Should the archived stream be deleted or truncated after writing an activation event? It can be left outside for the user to decide, or it can be a built-in feature.

alexeyzimarev avatar Jan 24 '22 13:01 alexeyzimarev

@alexeyzimarev - this sounds very interesting. I wondered if the following case would be handled by this.

We were running Evenstore 21.6 and one of our projections started writing out every single state change as event (the projection was partitioned)

So the stream created for specific aprtion was:

$projections-projectionStreamManager-StoreProduct-027c3bace2d1188bf5a2c3c5f193c49f-result

And this contained the state every time it was changed

Our Evenstore is flooded with this (gigs of space unfortunately)

We were going to write a tool to try and reclaim this, but maybe this gives us an altenative.

Btw, upgrading to 21.10 fixed the issue.

StevenBlair123 avatar Jan 24 '22 14:01 StevenBlair123

The archive is more for closing business entities that aren't going to change anymore. Like when the order is fulfilled you might not need to keep it anymore. Restore in this case would bring the stream back so the original entity can be mutated with new events, also read models can re-project the stream again and make it available for queries.

alexeyzimarev avatar Jan 24 '22 16:01 alexeyzimarev

Ok, the design proposal is:

  • Archiving could be done by moving events to the archive store
  • Events can be copied all the time (with Elastic connector, for example), or it is an explicit action
    • Implicit archive: stream max-age is set to one year, and old events get scavenged. when needed, old events can be retrieved from the archive
    • Explicit archive: when the stream reaches certain size, events get archived, and the stream is truncated
  • When reading the stream because we handle a command, we try loading events from the primary store. If the stream is missing, or is incomplete (event number starts from more than zero), missing events are fetched from the archive
  • Snapshot is an event

alexeyzimarev avatar Apr 08 '22 11:04 alexeyzimarev

Regarding handling commands on archived streams I foresee a problem with projected readmodels. Given a stream is archived, and a readmodel is rehydrated afterwards, it will not receive archived events. When a command is handled (based on archived events), the new event(s) would be the first events of that aggregate that said projection would receive.

My suggestion is to not promote such feature by providing it in the library. People can build it themselves should they need to. It could be helpful to have a specific exception thrown by command-handling if/when a command is handled for an archived stream (i imagine the stream would only contain a "stream archived" event at that time, previous events being truncated/scavenged).

A compromise could be to, by configuration, allow the command-handler to "un-archive" a stream automatically, writing old events back to the primary storeaspart of the command-handling transaction.

JulianMay avatar Apr 08 '22 14:04 JulianMay

Yes, we were discussing this as well :) Replays will be highly problematic.

However, a colleague made this point: you introduced a new feature, so you can hardly expect that all your 15 years of data will be present in this new feature. Say, you want to show the number of bookings cancelled over time. I won't expect to see the total number of cancellations for 10 years back. In reality, I only want to know this for the last couple of months. The archiving strategy needs to be tuned with those requirements in mind.

Plus, we don't want to have the StreamArchived event really. The idea is to have a composition event store, which will load the stream from the operational store and check the first event version. If the version is higher than zero, it will attempt to get the remaining set of events from the archive store. So, for executing commands it will be fully transparent. You'd need to be aware of this and be prepared for it in projections accordingly. The archiving process can be done mechanically. For example (that's what we plan) the Elastic connector replicates everything to the archive store. Not only you get all the analytics in Kibana, you also get an archive with tiered storage, and it's very cheap. "Archiving" as such happens by setting the max age of the stream in ESDB. So, when you accidentally hit a need to execute a transaction on a very old stream, you get all the events from both stores.

alexeyzimarev avatar Apr 08 '22 14:04 alexeyzimarev

Unachiving streams would be undesirable as it will fuck up the versioning, and create additional concerns for projections and replication of the events to the archive store (it must be fully idempotent, and we must use the original event ids).

alexeyzimarev avatar Apr 08 '22 14:04 alexeyzimarev

That's why I want dev calls 😅. We were discussing those issues just two hours ago...

alexeyzimarev avatar Apr 08 '22 14:04 alexeyzimarev

Yea, having a discussion asynchronously like this is never optimal 😌 You make good points (no surprise), I'm just concerned about the the " You'd need to be aware of this and be prepared for it in projections accordingly ", At least this caveat should ideally be prominently mentioned in the documentation around archiving. I would be curious to see examples of how to prepare for it, like would your projections need access to the archive as well? To me it seems it would have to, unless you can simply ignore the new event to an otherwise archived stream.

The scenario I'm circling around is "users invoking commands for an archived aggregate" - maybe I'm overthinking it... The user would probably have a view from a projected readmodel from before the stream was archived, and would assume that view to be updated with the changes they made - which would work. The day after though, that (order, booking, whatever) would be gone though, because the projection was replayed having only the change from the day before and therefore not being viewable (or at least very wrong-looking).

If we allow users to handle commands on archived streams, we should expect they could make several, over days, weeks, even though it has been archived for years?

To me it seems like one of those situations where it's better to say "you can't, but you can easily make a copy of the old one and make you changes to that.", so not un-archiving, more like superseding.

I'm always open for a call if you'd like to chat 🤙

JulianMay avatar Apr 08 '22 15:04 JulianMay

The user would probably have a view from a projected readmodel from before the stream was archived, and would assume that view to be updated with the changes they made - which would work.

Correct, that's the idea

The day after though, that (order, booking, whatever) would be gone though, because the projection was replayed having only the change from the day before and therefore not being viewable (or at least very wrong-looking).

If we talk about the same aggregate, we have the following:

  • There's a read model built before from archived events, so it's correct
  • We get a new command
  • We read the archived stream to get the aggregate state
  • The new event got appended to the stream in the hot store
  • This new event gets projected to all the read models
  • It will also stay in the hot store until it gets truncated based on the stream TTL or size

There's no issue here.

We do, still, have an issue with re-projecting everything. But then again, just think about it. Does re-playing everything ever was a good idea? Say, we have a projection "pending arrivals". It always looks into the future, so replaying all the history would just create a lot of useless ops on the reporting database, as we add a record to "pending arrivals" when the booking is paid, and remove it after the guest checked in. Projecting the whole history here would mean that we do a lot of appends and deletions for no reason at all, as none of those historical arrivals is relevant, they happened long in the past. If we just project from what's in the hot store, we don't project the whole history, but just a relevant part of it.

What we discussed is that data in the hot store could give enough events to build new read models that give relevant information about what recently happened. It depends on the use case, of course. It makes me think that the starting point of a new projection needs some flexibility as right now it's from the beginning of times.

At the other hand, if all the events in the archive store are produced by a connector, they look 100% identical to the original events, including the commit position, etc. So, I don't think it's impossible to have a subscription there, a catch-up subscription without the realtime functionality.

alexeyzimarev avatar Apr 09 '22 07:04 alexeyzimarev

As I said "I'm probably overthinking it" 😌 What trips me up is if you one command on an archived aggregate, why not several - with maybe months in between, and readmodel rehydrations in the meantime.

That temporal couplin seems unnerving - But I probably just don't have a good enough idea of a relevant usecase to think it through. The ability to archive streams is definitely a good and important feature, and it should be modelled how that fits in to the respective application.

JulianMay avatar Apr 09 '22 09:04 JulianMay

The use case is simple. We have 15 years of data collected in SQL, and now we are migrating to an event-sourced system. About 1.6 billion streams will be initiated by the ImportedFromLegacy event. Our users, however, only make mutations for the streams that are just a couple of weeks old (that's a usual behaviour almost everywhere), maybe two months. We also have a cap for aggregate mutability, which is set to two years from the date the aggregate is created.

So, we decided not to move all the events through ESDB as they will be just sitting there doing nothing, occupying space and making the database too large. We will project the legacy to the read model by a one-time importer, at the same time as we produce those events. But we want to produce the events directly to the archive. That gives us the first scenario - if the stream is not found, let's look in the archive. Then, a mutation of an archived aggregate would be following the normal flow:

  • rehydrate from the archive
  • apply new events
  • new events go to the regular store
  • new events get copied to the archive using the connector
  • new events get projected to all the read models as normal
  • eventually, those events will get scavenged, and disappear from the main store

The original issue was about something else, which is still in plans. A deliberate action to archive the stream. But, it's very similar, as the first step would be to support reading events from the archive seamlessly. The writing part for now would be connector-based (Elastic). The other types of archive store can be added later, and then the explicit archive will need to be implemented (cloud buckets, etc).

alexeyzimarev avatar Apr 09 '22 10:04 alexeyzimarev

"rehydrate from the archive" solves my concern, as you could do this if you needed to rehydrate while having "unfinished business" to do on an otherwise archived stream (the commands that triggered me). Thanks for elaborating 👍

JulianMay avatar Apr 09 '22 11:04 JulianMay

Right, I might have not explained the idea well enough.

I already built a store for Elastic, which can read and write (not sure about the writing part though).

The plan is now to split reads and writers, then make an archiving store using a composition of two readers and a single writer.

alexeyzimarev avatar Apr 09 '22 11:04 alexeyzimarev

Merged #80, it has the aggregate store with archive fallback, and sample implementation with Elastic.

alexeyzimarev avatar Apr 13 '22 07:04 alexeyzimarev

Releasing it as 0.7.0

alexeyzimarev avatar Apr 13 '22 07:04 alexeyzimarev