marquez icon indicating copy to clipboard operation
marquez copied to clipboard

Proposal: Data Retention Functionality

Open KevinMellott91 opened this issue 3 years ago • 1 comments

Background Some organizations require data retention policies to be enforced to meet legal and/or regulatory requirements. This proposal provides a few ways that this functionality could be implemented within Marquez.

Retention Scope Enabling data retention functionality will be an opt-in feature, and will be disabled by default. This ensures that existing deployments are unaffected, but provides the functionality to those who need it.

When targeting data for retention, the scope will be focused on dataset/job versions and job runs. Furthermore, the Marquez configurations will allow retention policies to be set individually for each type of entity (dataset version, job version, job run) to allow for flexible enforcement.

This will ensure that core entities (namespaces, sources, datasets, jobs) are not impacted and will not magically disappear from the system. Lineage events are also out of scope, so that we don't lose the ability to replay those events over time.

Retention Behavior Retention rules can be configured to purge items that are older than a given timeframe AND/OR are X versions away from the current version. Other systems have implemented retention using one of these conditions OR the other; however, I think flexibility is required for this to work with many real world scenarios.

Here is how this could be configured within a Marquez deployment.

# Rules can be combined under a section to ensure both conditions are met prior to data being purged.
retention:
  enabled: true # Enables retention enforcement, for the entities configured below. Omit sections when retention is not desired.
  schedule: "0 3 * * *" # When deployed into Kubernetes, enforce retention policies on this schedule.
  datasetVersions:
    recentItemsToKeep: 10 # Purges any dataset version records older than the 10 most recent.
  jobVersions:
    daysToKeep: 365 # Purges any job version records older than 1 year.
  jobRuns: # Purges job run records that are both older than the 10 most recent AND older than 1 year.
    recentItemsToKeep: 10
    daysToKeep: 365

Retention Enforcement Retention enforcement can be applied on a configurable schedule, using a Java container-based batch job. We can provide a script to execute this job via Docker to support local development, and also deploy this into Kubernetes as a CronJob as part of the Helm chart. The CronJob will execute on the schedule defined within the YAML configurations, and will purge the applicable records based on the retention policy rules.

As records are identified and purged, applicable information can be logged and incorporated into the metrics framework.

KevinMellott91 avatar Apr 28 '21 20:04 KevinMellott91

Hi Kevin, Thank you for the thoughtful proposal. This makes a lot of sense. Here are some amendments I would make.

  • We need to separate in the configuration the retention policy definition (recentItemsToKeep/daysToKeep) from the implementation configuration (in this case a scheduled job running on an interval). These are distinct concerns, and different strategies can be applied to enforce the policy. On the functional specification:
  • We need to spec out the retention policy rules in more details: In your example recentItemsToKeep has slightly different semantics whether it's used with daysToKeep or not. in datasetVersions it's a maximum ("purges any dataset version records older than the 10 most recent"), in jobRuns it's a min. ("Purges job run records that are both older than the 10 most recent AND older than 1 year")
  • It would be useful to define requirements when we should not keep some data after a certain time and why as well as whether there are rules for when we must keep the data for at least a given amount of time.
  • I would tie together more runs and versions to keep the model consistent. a datasetVersion is pointed to by a run. As is a jobVersion. How do we deal with a run pointing to a deleted job or dataset version? Do we prune related objects? Should we keep the runs that points to the datasetVersions we keep? On the implementation:
  • Once we have cleared that first part, There will be a need to discuss the performance aspects as well.

This is a great proposal and I'm looking forward to make progress on this. Thanks!

julienledem avatar Apr 29 '21 17:04 julienledem