kapacitor icon indicating copy to clipboard operation
kapacitor copied to clipboard

[Feature Request] stateDuration node should increment and emit "duration" on interval

Open AdamSLevy opened this issue 7 years ago • 10 comments

Summary

The stateDuration node allows you to create a new field representing the duration that a given state has been true. The current implementation depends entirely on incoming data for the node to emit a new metric. This means that this node cannot be used to accurately track durations at the resolution of the given unit if the data write frequency is less than that unit.

Example

Here is a stub of a TICK script to alert when door is open for over 3 seconds:

TICK

var data = stream|from()...

var state = data
    |eval(lambda: "door_closed" == FALSE)
        .as('state')
        .keep()
        .quiet()

var stateDuration = state
    |stateDuration(lambda: "state")
        .unit(1s)
        .as('duration')

stateDuration
    |alert()
        .topic(topic)
        .info(lambda: "duration" > 3)

Data

This following data has a write frequency of 1 minute updates if the data isn't changing. Data changes are written immediately. You can see that at 16:12:52 the door_closed changed to false. It remained in this state until 16:12:57, which is 5 seconds.

time                          door_closed
----                          -----------
2018-01-13T16:14:42.398-09:00 true
2018-01-13T16:13:42.398-09:00 true
2018-01-13T16:12:57.398-09:00 true
2018-01-13T16:12:52.398-09:00 false
2018-01-13T16:12:42.398-09:00 true
2018-01-13T16:11:42.398-09:00 true

Problem behavior

Clearly the door was open for longer than 3 seconds, but kapacitor remains oblivious because the stateDuration node only emits when it receives data, and the next data value it received reset the stateDuration because the door was closed.

Expected/Desired behavior

If the state that a stateDuration node is processing becomes true, it should emit a metric on the interval specified by the .unit() attribute so that any downstream processing has an up to date duration field to work with.

Possible workarounds

Of course one could make a rule for themselves to not try to detect a state change at resolutions below the underlying write frequency of the points the state is based on. But in my current data collection set up it is not convenient to adjust the write frequency for just this one data point. And it's certainly not convenient to have a unique write frequency for each point I'm interested in detecting state changes in.

Alternatively I might be able to join the stream with a batch query of the latest value at the interval of the state change resolution I want to detect. This could be written as a template that allows for the resolution duration to be specified. The downsides to this is that I suspect it will also suffer from limited accuracy. For example if I have a stateDuration at the resolution of 1 minute, and the state changes just after the regular batch minute update, then I may have to, in the worst case, wait nearly an entire additional minute before the duration above the threshold is emitted, assuming no other data comes in. This also strikes me as being unnecessarily inefficient when the stateDuration node already has everything it needs to know for this.

Please let me know if you have any questions Thank you

AdamSLevy avatar Jan 14 '18 02:01 AdamSLevy

@AdamSLevy I think this may be a feature request rather than a bug report.

Essentially, you'd like to emit the current duration that something has been in a certain state on a regular interval. My concern with doing this is that it assumes that data never arrives late. For example, suppose that I have a bit of network connectivity issues for ~10s and the data doesn't arrive for 10s. If Kapacitor has an internal clock and it emits that data every interval, I'll have a bunch of false positives. Do you imagine that this would be an issue? Or am I off base?

desa avatar Apr 03 '18 16:04 desa

Thank you so much for your response. Glad to see this is getting some attention.

I think this may be a feature request rather than a bug report.

You're right. Having read the documentation this isn't a bug in the sense that the behavior differs from what is documented. That being said, it certainly feels like a bug when no alerts arrive. The documentation requires a somewhat thorough and careful reading to understand how this node is supposed to work. In the meantime, it might be worth reworking the docs to make this behavior more obvious.

Essentially, you'd like to emit the current duration that something has been in a certain state on a regular interval.

Exactly right. I think this might look similar to the stats node in the sense that it would have its own clock, but it might start and stop emitting based on the state. The point of the stateDuration node (from my perspective, at least) is to trigger events when a state is true for a period of time. The current implementation makes it impossible to detect states that occur for durations less than the emit period of the underlying data.

My concern with doing this is that it assumes that data never arrives late.

Great point, but I don't see this as an issue for my use case. The issue specifically concerns infrequently emitted data. It taking longer to know about data loss/connectivity issues comes with the territory for such data. I can still set a deadman on that data but I have to set the time threshold higher than I would for frequently emitted data. This is just an inherent trade off of collecting data this way. The whole point here is to allow for alerting on states with durations less than the emit period.

In my case I have other data coming from the same source that is changing at a higher frequency. I can set a deadman alert on the entire measurement to let me know of connectivity issues sooner than I would if I was just watching the infrequently emitted data.

If Kapacitor has an internal clock and it emits that data every interval, I'll have a bunch of false positives.

One would necessarily need to use the 'alert only on state changes' option to avoid this. A subsequent deadman alert would let the user know that the previous alerts may be unreliable. Clearly you would probably want it to be implemented as a new opt-in option on the stateDuration node and allow the existing behavior of this node to remain unchanged for backwards compatibility of TICKscripts.

Motivation

Just to paint the whole picture of why someone would even want to collect data this way, consider a scenario where you are collecting process data from a remote site with a limited bandwidth and monthly data limits. You sample data at a high frequency and store the full resolution locally, but you have to upload only rollups and value changes to meet the bandwidth requirements. For multi-state (string) or boolean values this means only ever sending on change and possibly once every 1 or 2 minutes. You could run all of the alerts locally but changing software on this remote site is inconvenient and is always at least somewhat risky.

Also consider that the emit frequency of data is not always easy to control separately for individual fields. Additionally these data collection decisions are fully separated from the writing of TICKscripts. So the user has to think ahead and ask, "What will the highest resolution alert on this data be?" In my experience that isn't the typical thought process. Generally I ask, "What's the highest resolution I need to be able to see this data at?" and an acceptable resolution for data visualization can be much lower than the resolution that you might want to alert at, especially for non-quantitative values like Booleans or strings.

It is quite common to realize something new that you want to be alerted on after setting up the data collection initially. The ideal is to just be able to throw together a new TICKscript for that alert. But if you are trying to detect something at a resolution below the emit period then you would need to revisit your data collection to increase the emit frequency. And as I just described that's just not always practical.

Implementation Considerations

If the stateDuration node emits on its own clock, then it is worth asking what, if any, other data should it be emitting with the duration field. Normally when you get an alert you have some relevant fields that were emitted together allowing you to have a more descriptive alerting message that can include field values. The stats node, in contrast, drops all data since it operates on its own clock.

I think it would be most useful if the stateDuration node retains the latest value of any fields that pass through it and emits the latest values of all fields it has seen whenever it emits a new duration. I suspect that might violate a paradigm that kapacitor follows, and in my mind it leads to a larger question about how TICKscript is limited because it cannot really hold any state outside of the current fields passing through a node. There are some limited exceptions to this like the alert node maintaining the alert level or the sigma function in lambdas. But there are a number of creative ways a user might want to process data without regard for the emit frequency or time alignment of data. I know it is possible to work around much of this by properly aligning and joining data.

Another way to address this issue is to instead add a duration option to the alert node itself. So that you could write

var data = stream|from() ...

data|
  alert()
    .crit(lambda: "value" <= 100.0)
    .critDuration(10s)

This should still work the same way that I have described how I would like the stateDuration node to work so I recognize this just shifts the problem around. But it might make more sense to put it here with the alert node anyway. That's more of a style/design direction choice.

I'm happy to discuss any of this further. Thanks again for your time and attention on this.

AdamSLevy avatar Apr 03 '18 20:04 AdamSLevy

@AdamSLevy Thanks for your thorough response. We'll definitely keep this in mind. Would you have any interest in developing the feature yourself? I'd be happy to help guide you through the implementation.

desa avatar Apr 04 '18 15:04 desa

That would be cool. I am strong in Go but not fully up to speed on the kapacitor code base so guidance is more than welcome. I also want to make sure that we are clear on the design before I start banging away on some code.

Feel free to email me directly if that is a better channel for discussing this. Otherwise continuing in the comments here is OK

adam at aslevy dot com

AdamSLevy avatar Apr 04 '18 16:04 AdamSLevy

@AdamSLevy sorry for the long delay. I'm happy to talk about feature development here. I think it might be easier.

Ideally this would be implemented as a new node. I'm open to names. My guess is that it will be implemented similarly to https://github.com/influxdata/kapacitor/blob/master/state_tracking.go

Feel free to ask any relevant questions here. I'll do my best to respond promptly.

desa avatar Apr 10 '18 14:04 desa

Great. Thanks. I'll take a closer look at this and at the stats node since that has its own clock that it emits on. I don't expect to have time to write code until next week though FYI.

On Tue, Apr 10, 2018, 6:04 AM Michael Desa [email protected] wrote:

@AdamSLevy https://github.com/AdamSLevy sorry for the long delay. I'm happy to talk about feature development here. I think it might be easier.

Ideally this would be implemented as a new node. I'm open to names. My guess is that it will be implemented similarly to https://github.com/influxdata/kapacitor/blob/master/state_tracking.go

Feel free to ask any relevant questions here. I'll do my best to respond promptly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/influxdata/kapacitor/issues/1757#issuecomment-380110349, or mute the thread https://github.com/notifications/unsubscribe-auth/AF2XcAzpcw5S4dgZQsGqbdY_sw3uFVlPks5tnLvPgaJpZM4RddEu .

AdamSLevy avatar Apr 10 '18 19:04 AdamSLevy

I am afraid that new projects have taken precedence over working on a solution to this issue. However I am getting back into kapacitor and I see now that we have a Barrier node that will emit points on an interval. Would this be able to kick a stateDuration node into emitting a duration?

I'll have to test and report back.

AdamSLevy avatar Aug 07 '18 20:08 AdamSLevy

No it doesn't look like the barrier node will solve this.

AdamSLevy avatar Aug 08 '18 00:08 AdamSLevy

Any updates on this feature request?

aliakseiz avatar Nov 11 '20 07:11 aliakseiz

@psteinbachs -- let's review/triage further.

timhallinflux avatar Nov 12 '20 04:11 timhallinflux