featurebase icon indicating copy to clipboard operation
featurebase copied to clipboard

Time Series Retention Policy

Open raskle opened this issue 7 years ago • 7 comments

Description

Pilosa allows users to configure a Time Quantum to bucket time series SetBits. These buckets are internally managed per Frame in Views with new Views created as time progresses.

Currently these Views are never removed. A rolling data retention window would be useful. Similar to an ELK stack with a 30 day retention policy. Since we already have a built in aggregation system with the Time Quanta we can support more granular data retention policies.

Currently supported Time Quanta

Y - Year M - Month D - Day H - Hour

i.e. A incoming SetBit with a time stamp of 2006010215 would create the following Views: 2016 201601 20160102 2016010215

Looking back, Graphite's Carbon database had a similar tiered time series aggregation policy. For example you could store metrics quantized to every 15 seconds for 15 days. And you could combine these into complex overlapping aggregations policies for any time quantum. While offering unlimited flexibility, this system likely provided too much complexity, missing the point of what users really needed in a simplified interface.

I would propose that we expand Pilosa's time quantum configuration with a retention period for each quantum.

If you specified YMDH for 1 month We would maintain views for: Years - to ∞ Months - 1 Days - Max days for current month Hours - All hour quanta for current month

We could then layer on 48 hours

Years - to ∞ Months - 1 Days - Max days for current month Hours - Last 48 hours

Retention Policie

Could be specified in the follow structure: { "timeQuantum": "YMDH", "retention": { "year": 10, "month": 1, "day": 7, "hour": 48 } }

Default assumption is to store a quantum to ∞ unless explicitly stated Lower time quanta are only kept up until the highest retention rule unless explicitly stated.

raskle avatar Aug 24 '17 03:08 raskle

I'm a little confused by this proposal. Let's use your example for discussion:

{ "timeQuantum": "YMDH", "retention": { "year": 10, "month": 1, "day": 7, "hour": 48 } }

When I first read this, I thought perhaps this means that Pilosa would retain up to:

  • 10 year views
  • 1 month view
  • 7 day views
  • 48 hour views

If that's the case, I believe there may be a couple of issues with that strategy. First, does this mean on the first of every month, Pilosa would delete the previous month and leave itself with only 1 month representing today? Or does the retention mean "1 month" in addition to this month? Second, we would need to verify what the query builder is including when you perform a query that covers a range that no longer has the expected granularity. For example, if today is 2017-08-24 and I query:

Range(start="2016-01-01T00:00", end="2017-08-31T00:00")

my query is going to be made up of three views:

  • 2016
  • 2017-07 (assuming 1-month retention means last month and this month)
  • 2017-08

Obviously this wouldn't be very meaningful. Was your intention that the user would understand that they can't perform this query and still get accurate data with the retention policy they have? Would we instead raise an error for any query that is not supported by the current retention policy? Thinking about this, with that retention policy, the only queries that would make sense would be whole year(s) queries, whole day(s) queries, or whole hour(s) queries. No combination of those would be useful. Maybe that's fine, but it's not clear to me that that's your intention.

Now, that was all based on my assumption of how the proposed retention policy would work. When I got to this statement:

Lower time quanta are only kept up until the highest retention rule unless explicitly stated.

I wondered if perhaps I'm not understanding the proposal. Either way, maybe you can provide some other examples for how the user would configure the retention policy and what queries that would support. I kind of see the benefit of storing, for example, 10 year views while only storing 48 hour views, but I wonder if the benefit of that is worth the complexity/confusion over having a retention policy that just stores "the last 48 hours".

travisturner avatar Aug 24 '17 14:08 travisturner

@travisturner my example retention policy { "timeQuantum": "YMDH", "retention": { "year": 10, "month": 1, "day": 7, "hour": 48 } } was perhaps a poor choice. I included it to show the full scope of possibilities for each time quantum.

Instead I would imagine a more common retention policy like the last month: { "timeQuantum": "MDH", "retention": { "month": 1} }

In this example an incoming SetBits with a time stamp of 2016020215 would create the following Views:

  • 201602
  • 20160202
  • 2016020215

To support range queries for a rolling 30 day window I envisioned we would maintain views for the current and past month. Once the current time progresses to 2016030100 all views associated with year 2016 month 01 would be deleted. This enforces the implicit rule: Lower time quanta are only kept up until the highest retention rule unless explicitly stated. As all Day and Hour Views that fall under month 01 are also deleted.

More complicated retention policies with explicit granularity of lower time quanta could save memory and disk space. This could be useful for some applications with the understanding that certain range queries are no longer possible when the Views are truncated. There is a tradeoff between flexibility and complexity. My goal is simplifying setup while supporting the most likely use cases.

raskle avatar Aug 24 '17 19:08 raskle

that makes sense. should we restrict the retention value to have only one attribute? (i.e. return an error if the user defines a retention with more than one value, like: { "timeQuantum": "YMDH", "retention": { "year": 5, "month": 1} }

Now that I think about it, it still seems odd not to have the retention value be in a common unit if that's how we are expecting to use it. Why the need for different keys in the retention map? Why not just "retention": "30d" or something like that?

travisturner avatar Aug 24 '17 19:08 travisturner

I like your suggestion of a single retention key. This make the configuration simpler and eliminates the problem of overlapping policies.

raskle avatar Aug 24 '17 20:08 raskle

So how is this issue going now?

young118 avatar Jan 13 '20 09:01 young118

@young118 we haven't put much work into time fields at all lately. we definitely have some improvements that we want to make to time fields, including things like data expiration, but realistically that work won't begin until mid 2020.

travisturner avatar Jan 13 '20 15:01 travisturner

@travisturner OK, still thanks

young118 avatar Jan 20 '20 02:01 young118