loudml
loudml copied to clipboard
Donut Model doesn't fit well CPU usage , score always 100%
Hello.
I've readed loudml documentation and also some examples I've found , and I cannot understand, how I should create/train the model to fit well my example metric "usage_system cpu" from my server.
I've made a grafana dashboard with 2 panels
Panel 1:
- Real metric (green line)
- Predicted metric (orange dot)
- Upper bound (blue dashed line)
- Lower bound(purple dashed line)
Panel 2:
- Real Metric (green line)
- Predicted Metric (orange dot)
- Score (blue bar - right Y axis)
This is how it seems
As you can see, lower/upper bound seems to be too close one from the other, and always the score is 100%, predicted values doesn't follow the real metric shape.
As you can see, in the last 12 hours, cpu seems to have a heavy periodical component, seems easy to predict.
> list-models
linux_metrics_cpu_mean_usage_system_host_myserver_time_5m
> list-model-versions linux_metrics_cpu_mean_usage_system_host_myserver_time_5m
version active loss trained
00 0. 353.792 1.
01 0. 353.806 1.
02 1. 353.806 1.
> show-model linux_metrics_cpu_mean_usage_system_host_myserver_time_5m
- settings:
bucket_interval: 5m
default_bucket: myserver_linux
features:
- default: 0
field: usage_system
io: io
match_all:
- tag: host
value: myserver
measurement: cpu
metric: mean
name: mean_usage_system
grace_period: 0
interval: 5m
max_evals: 12
max_threshold: 90
min_threshold: 90
name: linux_metrics_cpu_mean_usage_system_host_myserver_time_5m
offset: 10s
run:
flag_abnormal_data: true
output_bucket: myserver_loudml
save_output_data: true
seasonality:
daytime: false
weekday: false
span: 100
type: donut
training:
job_id: 17b49b96-1c93-424b-9871-b7dd46737c6e
progress:
eval: 13
max_evals: 12
state: done
> list-scheduled-jobs
_eval(linux_metrics_cpu_mean_usage_system_host_myserver_time_5m)
> list-scheduled-jobs -a
- every:
count: 300.0
unit: seconds
last_run_timestamp: 1596088721.625509
method: post
name: _eval(linux_metrics_cpu_mean_usage_system_host_myserver_time_5m)
ok: true
params:
flag_abnormal_data: true
from: now-310s
output_bucket: myserver_loudml
save_output_data: true
to: now-10s
relative_url: /models/linux_metrics_cpu_mean_usage_system_host_myserver_time_5m/_eval
status_code: 200f
- Any idea on how to config the model to best fit the real metric?
- What exactly means "loss" in the output for the command
list-model-versions
?, which value is better greater/lower ? - Can I train the model again without stopping the current scheduled job? How can switch to the new trained parameters online?
- How I can make the model more "flexible" ie: greater the lower/upper difference in the way more real data can fit inside the difference?
- In the doc (https://loudml.io/en/loudml/reference/current/_evaluate.html) Lower/upper values are fixed to 99.7 percent confidence, There is any way to change it? should I do if we could ?
Al least? There is any example where to see how to play with this fine tuning?
Thank you very much to everybody!
Another example showing how different is prediccion over real data.( trained with 1 week data)
if zooming and showing computed "score" (blue bars , with secondary Y axis), only 3 predicted values below 70%
There is any way to tune the score "sensitivity" ?
Thank you very much!!!
I've scheduled a daily job to retrain the models, this morning one of them has improved the "loss"
> list-model-versions swarm@cpu@mean@usage_active@host_worker2_cpu_cpu-total@time@1m
version active loss trained
00 0. 116.357 1.
01 0. 116.357 1.
02 1. 63.588 1.
Still score too high to use anomalies detector..
I've notice that trained version doesn't save the trained period and how long the period trained was, and how much it has taken to finish the trainning, this information could help users to improve future trainnings
> list-model-versions swarm@cpu@mean@usage_active@host_worker2_cpu_cpu-total@time@1m
version active loss trained
00 0. 116.357 1. <---this one 7 days
01 0. 116.357 1. <---this one 7 days
02 1. 63.588 1. <--- this one only 1 day
Hi @regel a new example about confusing "score". I this example real data is clearly inside the confidence margin (upper/lower) but still mostly computing 100% score and mark data as anomaly. IMHO donut seems not working well on scoring data.
- real data in the first panel
- queried
@data
(10m mean) in green line in the second/third panel - prediction "data" as orange dot in second/third panel
- confidence margin in light blue in the second panel
- score (left axis) in the third panel
Could you help me to understand this behaviour on the scoring and anomaly detection? There is any way to tune the "sensitivity" of this scoring system?
Thank you very much