tsod icon indicating copy to clipboard operation
tsod copied to clipboard

Add benchmarking dataset with labelled anomalies for scoring performance of detector algorithms

Open halvgaard opened this issue 4 years ago • 12 comments

Do you know about any (open source) datasets at DHI that has labelled anomalies that we can use for testing? @ecomodeller @laurafroelich @akfDHI

halvgaard avatar Jan 20 '21 08:01 halvgaard

@ecomodeller I found some datasets with labelled anomalies here: https://github.com/numenta/NAB There are very few labels. But I guess that is the case with anomalies.

halvgaard avatar Jan 25 '21 08:01 halvgaard

@rhaDHI Have you checked out the license for that repo? it seems to be quite strict and copy-left, so if we want to use material from the numenta/NAB repo we need to change our license to the same one (AGPL-3.0 License) as far as I can tell. What do you think? If I am right, making our repo AGPL would then imply that anyone using our repo would also have to make it AGPL... maybe not what we want?

laurafroelich avatar Jan 25 '21 08:01 laurafroelich

I don't know any open datasets at DHI that we can use. We have to ask around and see if someone has some annotated dataset they are willing to share. There are lots of data, but not so many with labels and probably even fewer that are public, unfortunately.

ecomodeller avatar Jan 25 '21 08:01 ecomodeller

I will try to ask around on DHI yammer for labelled data sets with anomalies. @ecomodeller Do you have labels for the DMI data set we have in repo? Otherwise I will try to label the obvious ones with the algorithms, e.g. anomaly 1

halvgaard avatar Jan 28 '21 19:01 halvgaard

@laurafroelich @ecomodeller @akfDHI How do you like this message to be posted on yammer:

We are trying to establish best practices and automated ways of identifying anomalies/outliers in time series data. Please let us know if you:

  • have a dataset that needs to be cleaned automatically
  • have algorithms for detecting outliers lying around in your head or in actual code
  • have a data set, ideally publicly available, with labelled anomalies, i.e. an exact indication about which data points are actually anomalies.

Currently we are working on algorithms based on everything from simple range checks to machine learning models. Check out and potentially contribute to our open source anomaly detection python package on DHI's Github here: https://github.com/DHI/anomalydetection

halvgaard avatar Jan 28 '21 20:01 halvgaard

Sounds good to me :)

laurafroelich avatar Jan 29 '21 05:01 laurafroelich

Can we make an interactive application to assist the labelling process?

  1. Upload data
  2. Automatic labeling of obvious outliers with simple detector
  3. Manually add / remove labels by clicking on chart.
  4. Save the labelled timeseries in reusable format e.g. csv

ecomodeller avatar Jan 29 '21 07:01 ecomodeller

Sounds good to me too. Which Yammer channel?


From: Laura Froelich [email protected] Sent: Friday, 29 January 2021 06.36 To: DHI/anomalydetection [email protected] Cc: Anne Katrine V.Falk [email protected]; Mention [email protected] Subject: Re: [DHI/anomalydetection] Add benchmarking dataset with labelled anomalies for scoring performance of detector algorithms (#12)

Sounds good to me :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/DHI/anomalydetection/issues/12#issuecomment-769586998, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIECKWFY5HSULGVL4ZQVLS3S4JCMXANCNFSM4WKJFNNQ.

akfDHI avatar Jan 29 '21 07:01 akfDHI

@ecomodeller There is one open source tool here: https://trainset.geocene.com/

halvgaard avatar Jan 29 '21 08:01 halvgaard

@ecomodeller Is this relevant: http://www.marineinsitu.eu/dashboard ?

halvgaard avatar Feb 02 '21 11:02 halvgaard

We got a labelled dataset from an actual DHI case based on groundwater measurements. Unfortunately, the dataset cannot be published publicly on github.

halvgaard avatar Feb 25 '21 11:02 halvgaard

Can we make an interactive application to assist the labelling process?

  1. Upload data
  2. Automatic labeling of obvious outliers with simple detector
  3. Manually add / remove labels by clicking on chart.
  4. Save the labelled timeseries in reusable format e.g. csv

Please note that we now have an interactive application for labelling outliers and training a detector.

ecomodeller avatar Feb 23 '23 15:02 ecomodeller