anomaly-detection
anomaly-detection copied to clipboard
Checking feature queries as part of Validate API recommendation
Is your feature request related to a problem? Please describe. One of the non-blocker validate API goal is to determine an optimal detector interval length recommendation based on checking the data sparsity with all configurations applied. In the original PR (#384) the way feature queries were taken into account was by checking if all fields that are looked at by the feature queries exist in a single document during a given interval. However the logic behind this check was incorrect since the feature fields don’t have to all exist in the same document but rather can be in separate document but all in the same interval. A separate PR (#412) made sure to remove this check from the interval recommendation making it so we are not over validating but instead not taking the feature queries into account when recommending an interval.
Describe the solution you'd like Possible solutions:
- In order to find an optimal interval and further identify if the feature query is the root cause of sparse data we can try multiple different intervals for each feature query the detector has.
- This means we might find multiple different interval suggestion for each feature query and we can then recommend the longest interval out of all feature queries.
- We then potentially need to decide how to deal with the case where some feature queries lead to a different interval recommendation and some feature queries don’t have enough data no matter the interval (change response type so we can both provide interval rec and state the features with issue or simply state the features with no interval found)
- Add a sub-aggregation that looks for the feature fields inside of each interval bucket within the date histogram aggregation that is currently implemented
- This will need to have further perf testing as it would mean a sub aggregation within up to 1440 different interval buckets, and this call itself is already occurring multiple times for different intervals.
This looks really interesting! From an end-user perspective, does this improve AD accuracy, performance, or something else?
@elfisher From the user-perspective it means the Validation API is more accurate. In Opensearch 1.3 we added a new Validation API that runs on the last step of creating an Anomaly detector which validates if the given configurations will likely create a detector that successfully initializes and completes model training. Users can also call the Validation API directly through the backend https://opensearch.org/docs/latest/monitoring-plugins/ad/api/#validate-detector.
This enhancement will lead to the Validation API having the ability to suggest to user if a specific feature field is causing sparse data and give user a call-out they should probably change there feature field or to expect potentially longer initialization times or no initialization at all. Currently Validation API doesn’t take the specific feature fields completely into account and just the other configurations. Basically before creating a detector users will be even more informed if there configurations have any issues.
Thanks for the clarification @amitgalitz! If you don't mind, can you create an issue in the doc repo to track this update for 2.1? Since this is an improvement to the API we should make sure we get it documented.
@amitgalitz should this be re-labeled as 2.3?
@amitgalitz should this be re-labeled as 2.3?
Good point, I'll actually remove version label right now and discuss with Sean on priority
@amitgalitz should this be re-labeled as 2.3?
Good point, I'll actually remove version label right now and discuss with Sean on priority
Sounds good - i'll set as 2.3 tentatively
sent the PR to fix the issue: https://github.com/opensearch-project/anomaly-detection/pull/1258
My solution is a little different:
- When suggesting interval, I am using cold start queries if the feature exist (it is possible users haven’t defined features when invoking suggest interval APIs). If there is any one feature missing, I would regard the sample (might including multiple features) missing.
- When finding feature sparsity, I would use feature aggregation instead of exist query now. I found people can write complicated query in a feature, the current logic won’t be able to find feature field name. Also, as you mentioned in the issue, not all fields might exist in the same documents. Also, they may create a runtime field too. That’s even more complex.
- If a feature aggregation returns enough non-empty value for cold start, I would assume it won’t cause sparsity.
close the issue for now. @amitgalitz feel free to reopen if you find my PR needs improvement.