loudml icon indicating copy to clipboard operation
loudml copied to clipboard

Applying single model to InfluxDB measurement with multiple series

Open voiprodrigo opened this issue 6 years ago • 8 comments

Hi,

Many times there is the need to detect anomalies on top of multiple series of the same measurement, for example bandwidth utilisation of all ports of a network switch, or even the switch name may be a tag. From what I understood from the docs, you can only define tag k/v pairs as filters to select a specific series, but if LoudML could train a model for each series of the measurement depending on a configured set of tags (e.g. all values of tags x and y, or an array of tags values), and output accordingly, I think it would be a great addition.

Thanks.

voiprodrigo avatar Jul 04 '18 01:07 voiprodrigo

Hi Rodrigo,

thanks a lot for this proposal.

If we make the following additions, will it match your requirements?

loudml train --tags tag1,tag2,....

  1. add a --tags option (Or -T) to loudml train command in the CLI
  2. then, use all possible key,pair values for this tag, to train a single model
  3. apply this single model to live data, with extra options in the CLI for predict command to select given key/value pairs?
  4. and finally, tag the output data points (prediction_*) with the same tag names and values as the original series

In Chronograf, this training behaviour would be triggered automatically by using the 1click ML feature and clicking one or more "Group By" buttons, in blue: tags

Let us know your comments.

regel avatar Aug 16 '18 13:08 regel

Hi Sebastien,

Looks like a sound approach, and I like the Group By, but in a dynamic environment I don’t want to be concerned with setting explicit tag value filters. That is of course very useful for many cases, but depends on the objective. For example, nodes can come and go, so the values of the node tag will change over time. At the same time I would be interested in a separate prediction/forecast for each node, on a single field key. That would be a group by without any where on the tags. From your outline, I’m not sure if that would be considered one or multiple models (from a licensing perspective).

Thanks!

voiprodrigo avatar Aug 20 '18 21:08 voiprodrigo

I think I have a similar use case as the one @voiprodrigo describes. I'm storing under the same measurement values with different tags, for instance: job_id, locale, error_count (value). In this case, training a model for the entire measurement is not ideal, because each job_id behaves independently of the rest and could introduce deviations.

Using the tags filters would mean the need to manually specify each job_id-locale combination and train different models (1 per combination).

What I would like to have is the option to group each job_id-locale combination without manually specifying the values in the tag section.

I guess, that extending the previous proposal to allow to specify several tag keys at the same time and generating the model & tagging the output datapoint with the tags combination would allow solving this case. I guess that something like --tags=job_id,locale?

jorgelbg avatar Aug 21 '18 12:08 jorgelbg

Sounds like we need to have:

  • A wildcard * capability
  • and also allow to overwrite the default prediction_{{model_name}} measurement with something else, defined by the user

The training part is more complex. Training a single model for distinct series/with distinct tags assumes they more or less all have a similar pattern. Do you already tag series according to "expected pattern type" or this should be discovered dynamically?

regel avatar Aug 22 '18 13:08 regel

Allowing to override the measurement would be great, even more, if it's possible to interpolate the values of the tags in the title (or concatenated at the end). Like: prediction_avg_error_count_{{job_id}}_{{locale}} (for my case).

Considering that one single model assumes a similar pattern/behaviour, perhaps allowing to generate N dynamic models (generated by the combination of tags) could be a better fit? So if you say something like --tags=job_id,locale then 1 model is going to be generated by each combination of these tags. It would be ideal if is reported a single model (perhaps) although underneath you've several models that are evaluated individually.

But to be honest even if each model is generated individually it would be ok, because the model could've the match_all section already configured and would know how to filter the measurement.

jorgelbg avatar Aug 22 '18 13:08 jorgelbg

It's a complex one. Trying to list what will be needed:

  • [ ] Having a custom measurement name, rather than the default prediction_* name
  • [ ] Using template {} values in this measurement name, eg {model}, and so on.
  • [x] Tagging the output measurement with tags. We have, at least, a partial solution implemented in 1.4.0
  • [ ] Model templating, with wildcard capabilities
  • [ ] Training, and therefore inference, for specific tag values

regel avatar Aug 28 '18 09:08 regel

@regel Hi, my measurements have differents values inside that are identified by a specific tag. For example I have a measurement kwh with a tag _id that identify the id of the device that sent those value. So to predict consumption of a specific device I have to trai the model by specify to query only values with that specific tag _id=xxxx, Is this feature already available now or is still a work in progress?

robertsLando avatar Jan 22 '20 12:01 robertsLando

I really wonder why LoudML selected bucket option with the data, when database query would have been the KIS solution. Like timeseries databases are great for combining data and now I cannot have it.

For example I've customer CH with multiple (3) network interfaces in active-active-standby configuration. I am not interested traffic in single interface but total customer CH traffic.

In grafana/influx I'd do with query select sum(ifInOctets) as "total ifInOctets" from interfaceTraffic where ifName=~/^CH.*/ and time > now()-4h GROUP BY time fill(null)

Unfortunately I cannot enter that query to LoudML, LoudML seems to limit unnecessary data mining. From LoudML perspective that returns single feature, one value per time bucket.

So in my mind, ditch the buckets and bucket configuration. Allow direct queries, like LoudML chronograf data explorer you can create your query to be 100% exact for the data you want, and push that to LoudML engine..

joriws avatar Mar 31 '21 07:03 joriws