anomaly-detection icon indicating copy to clipboard operation
anomaly-detection copied to clipboard

Checking feature queries as part of Validate API recommendation

Open amitgalitz opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe. One of the non-blocker validate API goal is to determine an optimal detector interval length recommendation based on checking the data sparsity with all configurations applied. In the original PR (#384) the way feature queries were taken into account was by checking if all fields that are looked at by the feature queries exist in a single document during a given interval. However the logic behind this check was incorrect since the feature fields don’t have to all exist in the same document but rather can be in separate document but all in the same interval. A separate PR (#412) made sure to remove this check from the interval recommendation making it so we are not over validating but instead not taking the feature queries into account when recommending an interval.

Describe the solution you'd like Possible solutions:

  1. In order to find an optimal interval and further identify if the feature query is the root cause of sparse data we can try multiple different intervals for each feature query the detector has.
    1. This means we might find multiple different interval suggestion for each feature query and we can then recommend the longest interval out of all feature queries.
    2. We then potentially need to decide how to deal with the case where some feature queries lead to a different interval recommendation and some feature queries don’t have enough data no matter the interval (change response type so we can both provide interval rec and state the features with issue or simply state the features with no interval found)
  2. Add a sub-aggregation that looks for the feature fields inside of each interval bucket within the date histogram aggregation that is currently implemented
    1. This will need to have further perf testing as it would mean a sub aggregation within up to 1440 different interval buckets, and this call itself is already occurring multiple times for different intervals.

amitgalitz avatar Mar 15 '22 17:03 amitgalitz

This looks really interesting! From an end-user perspective, does this improve AD accuracy, performance, or something else?

elfisher avatar Apr 21 '22 12:04 elfisher

@elfisher From the user-perspective it means the Validation API is more accurate. In Opensearch 1.3 we added a new Validation API that runs on the last step of creating an Anomaly detector which validates if the given configurations will likely create a detector that successfully initializes and completes model training. Users can also call the Validation API directly through the backend https://opensearch.org/docs/latest/monitoring-plugins/ad/api/#validate-detector.


This enhancement will lead to the Validation API having the ability to suggest to user if a specific feature field is causing sparse data and give user a call-out they should probably change there feature field or to expect potentially longer initialization times or no initialization at all. Currently Validation API doesn’t take the specific feature fields completely into account and just the other configurations. 
 
Basically before creating a detector users will be even more informed if there configurations have any issues.

amitgalitz avatar Apr 21 '22 17:04 amitgalitz

Thanks for the clarification @amitgalitz! If you don't mind, can you create an issue in the doc repo to track this update for 2.1? Since this is an improvement to the API we should make sure we get it documented.

elfisher avatar May 19 '22 13:05 elfisher

@amitgalitz should this be re-labeled as 2.3?

ohltyler avatar Aug 10 '22 18:08 ohltyler

@amitgalitz should this be re-labeled as 2.3?

Good point, I'll actually remove version label right now and discuss with Sean on priority

amitgalitz avatar Aug 10 '22 18:08 amitgalitz

@amitgalitz should this be re-labeled as 2.3?

Good point, I'll actually remove version label right now and discuss with Sean on priority

Sounds good - i'll set as 2.3 tentatively

ohltyler avatar Aug 10 '22 19:08 ohltyler

sent the PR to fix the issue: https://github.com/opensearch-project/anomaly-detection/pull/1258

My solution is a little different:

  • When suggesting interval, I am using cold start queries if the feature exist (it is possible users haven’t defined features when invoking suggest interval APIs). If there is any one feature missing, I would regard the sample (might including multiple features) missing.
  • When finding feature sparsity, I would use feature aggregation instead of exist query now. I found people can write complicated query in a feature, the current logic won’t be able to find feature field name. Also, as you mentioned in the issue, not all fields might exist in the same documents. Also, they may create a runtime field too. That’s even more complex.
  • If a feature aggregation returns enough non-empty value for cold start, I would assume it won’t cause sparsity.

kaituo avatar Jul 08 '24 18:07 kaituo

close the issue for now. @amitgalitz feel free to reopen if you find my PR needs improvement.

kaituo avatar Jul 08 '24 18:07 kaituo