arcade
arcade copied to clipboard
Refine pipeline predictions to use only recent data
- [ ] This issue is blocking
- [ ] This issue is causing unreasonable pain
Currently, the Kusto Query that powers the pipeline predictions takes all available data in our Kusto database to compute the mean and standard deviation. This is not completely ideal (but has proven accurate in testing), we'd like to use only recent data to be able to adapt to changes in pipelines.
What we would like to do is take say the last ~30 days of data, per pipeline. There's not a completely idiomatic way to do this in Kusto.
This is not completely ideal (but has proven accurate in testing),
I think the only reason this is true is because there haven't been a lot of changes to the pipelines. The reason why we need to limit these is because if there's a drastic change in a pipeline, we will probably only pick the duration changes after a really really long time, since we're always taking all the data.
I think this is probably the most actionable thing we can do to reduce kusto usage from Queue insights right now so I'll take a look at this.
This is a tricky query to improve without making some substantial changes to queue insights:
- We want to look at the latest 30 runs of a given pipeline, and adjust our predictions based on that.
- We don't know how far back in time we have to go to get those 30 runs, especially for pipelines that don't run often.
- If we are too aggressive with picking a cutoff time for the data we look for, the accuracy of the predictions goes down
A solution we could pursue is to write down the data we find from kusto in an azure storage table, and use the info in the table instead of querying Kusto directly.
Given the recent improvements in Kusto usage, it doesn't seem like it's worth making these changes at this time, so I'm moving this back to the backlog for now.