influxdb
influxdb copied to clipboard
Data Retention Policy Support
We need to add data retention support to Core/Enterprise to provide similar functionality that was provided in previous implementations of InfluxDB. This is functionality to automatically delete data after it has hit a certain age.
Note: I am calling for days and not hours below - but this needs validated. We did have hours in previous versions. Trying to gather some data to validate this need. Deleting data with compaction also running on Enterprise has the opportunity for conflict - so I want to see if this can be minimized by limiting deletion of expired data to once per day.
MVP General Requirements (Database Level Retention)
- At a database level, a customer must be able to set a retention period (number of days)
- The policy must have options for a number of days (delete data older than) or "never" - to allow unlimited retention
- The user must be able to change a retention policy that they previously set; however, it is not expected that any data would be recovered if they changed, for example, from 7 days to 21 days. Any data over 7 days (in this example) that has already been deleted does not have to be recoverable.
- When data is older than than the retention policy, the data should begin to be deleted; however it is not expected that it will be an immediate deletion and may be deleted in batch
- Determining the data age must be based on the timestamp of each record, and not "when" the data arrived; meaning - if the retention days is set at 30, and a late arriving record comes in with a timestamp that is already 29 days old, that record, despite being just added, will be eligible for deletion
- When a user is configuring a retention period and is setting it lower than its current setting, the user should be warned that it could result in existing data being deleted, and they should confirm this is what they want to do.
Post-MVP General Requirements (Table Level Retention)
- At a table level, a customer must be able to set a retention period
- The policy must have options for a number of days (delete data older than) or "never" - to allow unlimited retention
- The policy must not supercede the max policy of the database; meaning - if the database is set to 90 days, then the table retention cannot be set to higher than 90 days (but it can be set lower, or can be set to any value if the database is set at "never").
- The user must be able to change a retention policy that they previously set; however, it is not expected that any data would be recovered if they changed, for example, from 7 days to 21 days. Any data over 7 days (in this example) that has already been deleted does not have to be recoverable.
- When data is older than than the retention policy, the data should begin to be deleted; however it is not expected that it will be an immediate deletion and may be deleted in batch
- Determining the data age must be based on the timestamp of each record, and not "when" the data arrived; meaning - if the retention days is set at 30, and a late arriving record comes in with a timestamp that is already 29 days old, that record, despite being just added, will be eligible for deletion
- When a user is configuring a retention period and is setting it lower than its current setting, the user should be warned that it could result in existing data being deleted, and they should confirm this is what they want to do
(other non-MVP Requirements)
- Users must be able to set the database retention as part of the create database function (as additional parameters), vs. a separate step
- Users have an API that allows them to set the retention periods (for MVP, CLI is satisfactory)
Global RP: Core & Enterprise Table RP: Enterprise
Just a few notes:
-
I think we need to refer to this as ”retention period” (like we do in v2) instead of “retention policy”. Retention policy has some baggage associated with it since, in v1, it was part of the data model and fully-qualified measurement name (
db.rp.measurement). -
I think there are two levels of retention enforcement we should/need to consider and implement:
- Query level: queries should filter out any data outside of the queried table/database’s retention period, even though the data may still exist in the object store.
- Storage level: this is where expired data actually gets evicted from the object store. AFAIK, we can’t evict individual lines in a Parquet file (yet), so to truly evict expired data, we have to remove the entire Parquet file (query level retention enforcement will keep expired data out of query results). This means that all data in the Parquet file needs to be expired before we can actually remove it. As data is compacted, the time range covered by each Parquet file grows, so it will actually take longer for Parquet files to fully expire. I think this is fine, we just need to clearly set the expectation for users that evictions (on the storage level) aren’t immediate.
-
In all previous InfluxDB versions, the minimum retention period is 1h. I think we should still allow this in InfluxDB 3. Does retention period flexibility complicate the implementation or do we need to force sane increments (
1h,12h,1d,90detc.? If it doesn’t complicate the implementation, I think giving users freedom with a minimum of 1h is the way to go.This will mostly affect query-level retention enforcement. Storage level retention will be enforced much less often, so letting users define fairly custom retention periods won’t affect storage level retention as much.
-
I think this is implied in the requirements above, but I’d like to explicitly call it out. All databases/tables created to this point have been created without a retention period. The MVP needs to run a catalog migration to add an infinite retention period to all existing databases. We also need to add the ability to update a database/table via the API and CLI:
PATCH/api/v3/configure/databasePATCH/api/v3/configure/tableinfluxdb3 update databaseinfluxdb3 update table
Another thought. We probably wouldn’t need this for the MVP, but we may also consider write-level retention enforcement. Any points with timestamps outside of the target database/tables’s retention period get rejected on write.
We should not be too much to discuss the complexity of the product because of changes in technical implementation , but should pay more attention to whether the user needs this ability ; compared to influxdb1.x, 2.x version , as well as clickhouse and other products , have data automatically expired . This is a very useful feature to reduce the user's archiving operations, resource consumption and so on. Hopefully this capability will be on the agenda soon; there is also the 5-library limit, which should not be something we abandon with influxdb3. We are users of influxdb1.x, but influxdb3 has abandoned a lot of good design in influxdb1.x. I feel very sorry, and hope that the influxdb community can pay attention to these issues!
This feature is a priority. Right now we have it on the agenda for the 3.1 release, which will come in mid to late May.
This feature is a priority. Right now we have it on the agenda for the 3.1 release, which will come in mid to late May.
Will the retention period be able to be defined and modified post-creation of a database, e.g., so that a retention period can be added to a database created now with 3.0?
Will the retention period be able to be defined and modified post-creation of a database, e.g., so that a retention period can be added to a database created now with 3.0?
@lnjustin yes.
This is now completed work in Core, Enterprise has a few things left tracked in https://github.com/influxdata/influxdb_pro/issues/895
@mgattozzi the link is broken :(