loudml icon indicating copy to clipboard operation
loudml copied to clipboard

Donut Model doesn't fit well CPU usage , score always 100%

Open toni-moreno opened this issue 3 years ago • 3 comments

Hello.

I've readed loudml documentation and also some examples I've found , and I cannot understand, how I should create/train the model to fit well my example metric "usage_system cpu" from my server.

I've made a grafana dashboard with 2 panels

Panel 1:

  • Real metric (green line)
  • Predicted metric (orange dot)
  • Upper bound (blue dashed line)
  • Lower bound(purple dashed line)

Panel 2:

  • Real Metric (green line)
  • Predicted Metric (orange dot)
  • Score (blue bar - right Y axis)

This is how it seems image

As you can see, lower/upper bound seems to be too close one from the other, and always the score is 100%, predicted values doesn't follow the real metric shape.

As you can see, in the last 12 hours, cpu seems to have a heavy periodical component, seems easy to predict.

image

> list-models
linux_metrics_cpu_mean_usage_system_host_myserver_time_5m
> list-model-versions linux_metrics_cpu_mean_usage_system_host_myserver_time_5m
version  active   loss     trained  
00       0.       353.792  1.       
01       0.       353.806  1.       
02       1.       353.806  1.       
> show-model linux_metrics_cpu_mean_usage_system_host_myserver_time_5m
- settings:
    bucket_interval: 5m
    default_bucket: myserver_linux
    features:
    - default: 0
      field: usage_system
      io: io
      match_all:
      - tag: host
        value: myserver
      measurement: cpu
      metric: mean
      name: mean_usage_system
    grace_period: 0
    interval: 5m
    max_evals: 12
    max_threshold: 90
    min_threshold: 90
    name: linux_metrics_cpu_mean_usage_system_host_myserver_time_5m
    offset: 10s
    run:
      flag_abnormal_data: true
      output_bucket: myserver_loudml
      save_output_data: true
    seasonality:
      daytime: false
      weekday: false
    span: 100
    type: donut
  training:
    job_id: 17b49b96-1c93-424b-9871-b7dd46737c6e
    progress:
      eval: 13
      max_evals: 12
    state: done
> list-scheduled-jobs
_eval(linux_metrics_cpu_mean_usage_system_host_myserver_time_5m)
> list-scheduled-jobs -a
- every:
    count: 300.0
    unit: seconds
  last_run_timestamp: 1596088721.625509
  method: post
  name: _eval(linux_metrics_cpu_mean_usage_system_host_myserver_time_5m)
  ok: true
  params:
    flag_abnormal_data: true
    from: now-310s
    output_bucket: myserver_loudml
    save_output_data: true
    to: now-10s
  relative_url: /models/linux_metrics_cpu_mean_usage_system_host_myserver_time_5m/_eval
  status_code: 200f

  • Any idea on how to config the model to best fit the real metric?
  • What exactly means "loss" in the output for the command list-model-versions?, which value is better greater/lower ?
  • Can I train the model again without stopping the current scheduled job? How can switch to the new trained parameters online?
  • How I can make the model more "flexible" ie: greater the lower/upper difference in the way more real data can fit inside the difference?
  • In the doc (https://loudml.io/en/loudml/reference/current/_evaluate.html) Lower/upper values are fixed to 99.7 percent confidence, There is any way to change it? should I do if we could ?

Al least? There is any example where to see how to play with this fine tuning?

Thank you very much to everybody!

toni-moreno avatar Jul 30 '20 06:07 toni-moreno

Another example showing how different is prediccion over real data.( trained with 1 week data)

image

if zooming and showing computed "score" (blue bars , with secondary Y axis), only 3 predicted values below 70%

image

There is any way to tune the score "sensitivity" ?

Thank you very much!!!

toni-moreno avatar Aug 05 '20 18:08 toni-moreno

I've scheduled a daily job to retrain the models, this morning one of them has improved the "loss"

image

> list-model-versions swarm@cpu@mean@usage_active@host_worker2_cpu_cpu-total@time@1m
version  active   loss     trained  
00       0.       116.357  1.       
01       0.       116.357  1.       
02       1.       63.588   1.       

Still score too high to use anomalies detector..

image

I've notice that trained version doesn't save the trained period and how long the period trained was, and how much it has taken to finish the trainning, this information could help users to improve future trainnings

> list-model-versions swarm@cpu@mean@usage_active@host_worker2_cpu_cpu-total@time@1m
version  active   loss     trained  
00       0.       116.357  1.       <---this one 7 days
01       0.       116.357  1.       <---this one 7 days
02       1.       63.588   1.       <--- this one only 1 day

toni-moreno avatar Aug 06 '20 06:08 toni-moreno

Hi @regel a new example about confusing "score". I this example real data is clearly inside the confidence margin (upper/lower) but still mostly computing 100% score and mark data as anomaly. IMHO donut seems not working well on scoring data.

image

  • real data in the first panel
  • queried @data (10m mean) in green line in the second/third panel
  • prediction "data" as orange dot in second/third panel
  • confidence margin in light blue in the second panel
  • score (left axis) in the third panel

Could you help me to understand this behaviour on the scoring and anomaly detection? There is any way to tune the "sensitivity" of this scoring system?

Thank you very much

toni-moreno avatar Aug 19 '20 12:08 toni-moreno