anomaly-detection
anomaly-detection copied to clipboard
Detector can't show up due to shingle size limitation
Problem
We received some user's question about why detectors can't show up after migration from 1.0 to 1.2. When run this
GET _opendistro/_anomaly_detection/detectors/_search
{
"size": 100,
"query": {
"match_all": {}
}
}
Error shows
{
"error" : {
"root_cause" : [
{
"type" : "a_d_validation_exception",
"reason" : "Shingle size must be a positive integer no larger than 60. Got 72"
}
],
"type" : "a_d_validation_exception",
"reason" : "Shingle size must be a positive integer no larger than 60. Got 72"
},
"status" : 500
}
Root cause
We add limitation on shingle size from 1.1
that it must be <=60, check https://github.com/opensearch-project/anomaly-detection/pull/149/ , AnomalyDetector.java change
public boolean invalidShingleSizeRange(Integer shingleSizeToTest) {
return shingleSizeToTest != null && (shingleSizeToTest < 1 || shingleSizeToTest > AnomalyDetectorSettings.MAX_SHINGLE_SIZE);
}
Solution
We have several possible options to solve this
- Ask user to update the detector configuration or delete old detector and create new one. Seems not good user experience.
- Add dynamic setting for shingle size limitation, so user can tune it by themselves.
- Relax or remove this shingle size limitation. This needs less development effort compared with option2, but not sure if it will bring any risk.
I prefer to use option2 to keep some limitation for shingle size, but also provide flexibility to let user tune it.
Also related to this issue is a slight inconsistency we have on the frontend ever since this PR was merged: https://github.com/opensearch-project/anomaly-detection-dashboards-plugin/pull/76. We let users know that we expect a shingle size between 1 to 60 however we allow users to input through the form shingle size greater than 60. This might be okay but wording should change that 1 to 60 is optimal but above is also okay. With validation API added in 1.2 we limit shingle size explicitly at 60 however letting users know "its not okay". Just adding this cause whatever option we choose here will impact how we implement the form validation and messaging on the frontend.
@ylwu-amzn I agree making it a tunable setting may be acceptable as long as there's enough wording in documentation and UI that indicates >60 may not yield the best results. This at least provides a way out for customers w/o having to re-create detectors. It feels to me this shouldn't be a blocker value, since the detector may still be operational with values >60, just not ideal.
@ylwu-amzn I agree making it a tunable setting may be acceptable as long as there's enough wording in documentation and UI that indicates >60 may not yield the best results. This at least provides a way out for customers w/o having to re-create detectors. It feels to me this shouldn't be a blocker value, since the detector may still be operational with values >60, just not ideal.
If we make it a tunable setting are you suggesting to keep it as a blocker still or no? We can make it tune-able to whatever and keep it as a blocker to whatever its set too and either not change our 60 suggestion or remove suggestion; or just remove the max number all together if its not a blocker setting and keep our suggestion at 60 since it doesn't make since for a user to tune the setting to to a max of 4 or 1000000 lets say and have that be our suggestion.
Not sure what will happen if shingle size is too big like exceeds 60. @sudiptoguha, do you have any suggestion? Should we block user creating a detector with big shingle size or just pop out warning that big shingle size may cause some issue( like model size will be quite big and initialization will be much longer)?
Great questions. In absence of large system constraints, what resonates with me is "proceed but let users know that such a setting may not be an ideal way to accomplish what they may be seeking". If someone else knows what they are doing, I don't see value in being restrictive when we have incomplete information. Let's take the specific questions one at a time:
-
Model Size: This is actually less of a issue with the RCF 2.0 and now 3.0 with internal shingling, and the standard time decay setting. The time decay sets up a "refresh rate" of sampling recent data -- note that large shingle sizes do not affect that as much :) This is actually measurable, but conceptually, suppose the time decay was so large such that we were doing sliding windows (can be achieved in the current algorithm) then increasing shingle size from 60 to 120 will be an additive increase in the time window. This is not true for 1.0 models AFAIK, and for models with external shingling (and shingle size 120 is twice the shingle size 60 models) and may be an impetus for users to convert to benefit from lower size/cost. The AD plugin can look into if the conversion subroutines are converting to internal shingling. And even for time decay 0 (fully random sampling) it would take a long time/lot of data for the size issue to show up. Model size is the least of concern from my point of view.
-
Heap Size: Heap usage will increase for larger shingle sizes, because the temporary structures created/anulled are twice as large. This is not too much an issue because off someone uses twice as large shingle size then they have a need and this increase is ephemeral (but see below).
-
Running time: Execution will be twice as slow with larger shingle sizes (because the probability calculations, etc. are over vectors which are twice as large). The heap size issue will also come in based on the GC strategy (if GC is invoked at twice the rate!) -- but that gets too far into specifics of JVM implementations which may be more opaque. This also is not too big a concern in my opinion -- if you need 120, you need 120.
-
Dimensionality and information: This is the issue in my opinion. If the phenomenon you're looking for requires 120 consecutive measurements, what is being sought? Did you need these 120 or can you perform downsampling/aggregations? Anomalies are events that correspond to an intuitive interval of time -- but do we need the finest grain measurement over larger intervals? Wavelet decompositions (transforms) may also be of interest to the user (generalizes aggregations). In the fullness of time we would include Wavelet projections in RCF -- but the user should probably exhaust aggregations/transforms first. That being said there is an use case of large shingle sizes -- if the data is indeed from a (conceptual) low dimensional space (described by a few parameters) then the large shingle size would help bring that into focus for forecasting type applications. For anomaly detection, large shingle size probably indicates a missing aggregation/transform step.
-
Latency: waiting for data is an issue -- large shingle size means more data at the initial stages. Again, that may make sense for forecast, but less to anomaly detection. We probably want to make a (useful, but potential 2-way) call early (and filter down subsequently, hopefully having fewer things to deal with).
cheers,
Thanks @sudiptoguha for the detail and deep analysis.
If I understand correctly, user may have some risk at increasing heap usage, higher latency, longer time for initialization and seems not so reasonable to use large shingle size for anomaly detection as
For anomaly detection, large shingle size probably indicates a missing aggregation/transform step.
But agree with you:
If someone else knows what they are doing, I don't see value in being restrictive when we have incomplete information
Maybe we can remove the hard limitation on shingle size, but we should also add detail message to tell user what's the impact by using big shingle size and user should aggregate/transform data first to reduce shingle size if possible. @sudiptoguha @amitgalitz @ohltyler @kaituo , Is this ok for you ? Any concern?
Yes (to your question/summary). I agree with removing the hard limitation with detailed messge re impact.
@ylwu-amzn @sudiptoguha We still have external shingling for single-stream detectors. So shingle size will matter on the model memory size. For example, with external shingling, dimension = 60, shingle size = 60, the memory size is 57 MB. Since we don't distribute models on multiple nodes, a huge model can bring uncertain load on a single node. We should refactor and use internal shingling in the future. Will discuss with Sudipto on the upgrade path.
Also, current memory size formula works for shingle size <= 60. For unlimited shingle size, I don't have a general formula.
I like "Add dynamic setting for shingle size limitation, so user can tune it by themselves." and give a warning anything above 60 can bring unknowns and possible instability on the cluster.
I do not think we should have numbers like 57MB without further discussion :)
-
The external shingling is for people who have existing models -- but the shingleSize change applies to people who are creating new detectors. Is that correct? In that case, would it be fair to say that the people who are creating new detectors would not see numbers like 57MB?
-
If one changes com.amazon.randomcutforest.examples.serialization.ProtostuffExamplesWithShingles.java as follows
int dimensions = 60; int numberOfTrees = 50; int sampleSize = 256; Precision precision = Precision.FLOAT_32; RandomCutForest forest = RandomCutForest.builder().compact(true).dimensions(dimensions) .numberOfTrees(numberOfTrees).sampleSize(sampleSize).precision(precision).shingleSize(dimensions) .build(); int count = 1; int dataSize = 1000 * sampleSize; for (double[] point : generateShingledData(dataSize, dimensions, 0)) { forest.update(point); } // Convert to an array of bytes and print the size RandomCutForestMapper mapper = new RandomCutForestMapper(); mapper.setSaveExecutorContextEnabled(true); mapper.setSaveTreeStateEnabled(true); Schema<RandomCutForestState> schema = RuntimeSchema.getSchema(RandomCutForestState.class); LinkedBuffer buffer = LinkedBuffer.allocate(512); byte[] bytes; try { RandomCutForestState state = mapper.toState(forest); bytes = ProtostuffIOUtil.toByteArray(state, schema, buffer); } finally { buffer.clear(); } System.out.printf("dimensions = %d, numberOfTrees = %d, sampleSize = %d, precision = %s%n", dimensions, numberOfTrees, sampleSize, precision); System.out.printf("protostuff size = %d bytes%n", bytes.length);
Then the output is
Picked up JAVA_TOOL_OPTIONS: -Dlog4j2.formatMsgNoLookups=true dimensions = 60, numberOfTrees = 50, sampleSize = 256, precision = FLOAT_32 protostuff size = 258525 bytes Looks good!
and further changing to "int dimensions = 120" one gets the following
dimensions = 120, numberOfTrees = 50, sampleSize = 256, precision = FLOAT_32 protostuff size = 268228 bytes Looks good!
Now (a) this is for 1D data (1-attribute data) shingled 60 or 120 times. Having 5 attributes would be a 5 fold increase only. (b) Protostuff may no longer be supported --- but the numbers are a far cry from 57 MB. And that is why folks should consider upgrading to most recent versions.
Discussed with Sudipto and Sean, we need to have a cap of shingle size due to memory and long latency issues. There is a code path triggering external shinling, so memory size is a concern. Our query cancelling is not fixed yet (https://github.com/opendistro-for-elasticsearch/anomaly-detection/issues/189), so long latency is a concern too.
Short term solution: increase shingle size to 128. Long term solution: refactor single-stream detector to use internal shingling for new detectors while keep using external shingling for old detectors. Fix the query cancelling.