metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

A way to clean up old runs / artifacts from datastore

Open tuulos opened this issue 4 years ago • 6 comments

There isn't an easy way to clean up old entries from the datastore. S3 lifecycle policies work for metadata but not for content-addressed artifacts, which may be shared by multiple runs, new and old.

One option is to introduce a new command, garbage-collect, which can perform garbage collection of old entries based on the knowledge of how the datastore is structured internally.

Discussion https://gitter.im/metaflow_org/community?at=609db88ac60a604255b9f11e

tuulos avatar May 14 '21 19:05 tuulos

I think we definitely need a story around this. Based on the thread and other conversations had, I think there are a few concerns to address:

  • overall size of data saved (and basically cleaning up old stuff)
  • compliance regulation (data can only be kept for a certain amount of time)

There are then various ways of accessing the data and/or indicating that it is stale that can be considered:

  • "production" date of the data. This obviously includes the initial write of the data (so when the first flow writes it) but could also include any other "writes" to it (ie: another flow wrote the same data); we do not capture this last information easily right now but this is something we should consider if we are to do a policy based on "time of write"
  • read date of the data. Data can be accessed by flows (and by the client) without being modified. We currently do not capture read dates anywhere.
  • tags can also be used to tag artifacts and allow them to be culled based on the tag value. This could actually extend to dates as well in a way (tags could just capture dates).

There are some thorny issues related to concurrent delete and access (data is present in the ds, flow checks if it exists, sees it and doesn't save the artifact, data is deleted) but this is definitely something we should work on.

romain-intel avatar May 14 '21 19:05 romain-intel

Yep, the topic probably deserves a design memo of its own but here's a quick strawman proposal to seed the discussion:

  1. new command metaflow garbage-collect --flow MyFlow
  2. the command traverses the datastore (S3), reference counting artifacts based on task metadata (data.json)
  3. a checkpoint file is created with the info from (2). Next time the command is run, only tasks newer than the checkpoint need to be analyzed and the checkpoint updated.
  4. delete runs and artifacts older than the limit

tuulos avatar May 18 '21 06:05 tuulos

Ideally the user could set a TTL rather than managing a job to GC...

talebzeghmi avatar May 19 '21 04:05 talebzeghmi

@talebzeghmi, do you mean a TTL on each run and having things auto-clean up after a while? It could be possible to basically run the GC job after each flow is run. In other words, taking @tuulos's idea and running that automatically at the end or beginning of each run of a flow. This would not clean out things that have not been run in a while (ie: stale flows) though (but then again, for that, you would need something to check; S3 unfortunately does not make it easy to "touch" files (thereby updating their TTL)).

@tuulos, I like the maintaining of a flow-level file. It's actually something that could possibly be done every time (as opposed to just went running the garbage-collect command). In both cases we would have to solve the issue of concurrent runs/gc.

I wonder though at what granularity the TTL should be set. Your proposal seems to be hinting at a run-level granularity (ie: everything in a run becomes stale at once) but there may be some value in a more graduated approach where runs don't expire all at once and, possibly, some intermediate data, useful for debugging and finding out what happened, expires first but the run sticks around with some of the more "permanent" data. This is obviously a little bit more complicated to implement and requires the user to think about the permanence of their data (which hadn't been a concern till now). In this scenario however, the TTL would be attached at the artifact level and a run would "disappear" once all its artifacts disappear (as opposed to the other way around where artifacts are unreferenced when a run becomes stale). This would actually have to be a bit more subtle than that because artifacts could be "permanent" in some runs and "ephemeral" in others. So maybe a better way would be that artifacts become "detached" from the run and a run goes away when it has no more attached artifacts.

romain-intel avatar May 19 '21 17:05 romain-intel

so is the proposal that metaflow garbage-collect --flow MyFlow would delete all previous runs for that flow? not sure where TTL is specified here

abdulsalama avatar May 24 '21 22:05 abdulsalama

Hi folks, wondering what the status of this is?

Context:

  • we need to delete data to comply with GDPR deletion requests
  • currently we've tried doing this by setting a lifecycle policy to expire objects in the Metaflow S3 bucket, but this seems to cause step functions to start failing after some time (possibly the code is also stored in S3 and gets expired?)

Is anyone else handling data deletion or lifecycle policies in a different way that we could use until this feature is prioritised? cc @tuulos

lsimoneau avatar May 31 '22 05:05 lsimoneau

Hey all, I'm also curious about the status of this request. In order to leverage Metaflow, we are also in a situation where our S3 bucket must have a lifecycle policy to ensure that we can comply with GDPR deletion requests. Unfortunately, doing this seems to result in a lot of friction where the only solution is to rerun the existing flow from scratch.

In almost all situations, the expired artifact seems to be the FlowSpec code itself, and with a distinct lack of resume behavior in SFN, we are often forced to waste days worth of developer time.

JustAnotherTDo avatar Oct 11 '22 20:10 JustAnotherTDo

We also have a centralised Airflow which stores data in a single S3 bucket.

Some flows generate pretty massive amount so data compared to others, and it'd be nice to have a way of GC'ing old flow data older than some threshold.

For instance, one flow has 42TB of S3 objects, other flows may have only KBs of data - it's difficult to expire this in a targeted way right now.

Limess avatar Feb 08 '23 13:02 Limess