Matt Bartos comments

Results 100 comments of


                                            Matt Bartos

insert_point is slow

Hi @atthom, Thanks for taking a look at this. The points don't need to be in every single tree as long as you make sure you're averaging codisp properly. Ultimately,...

insert_point is slow

Sounds good. If you want to contribute any parallelization code, feel free to submit a pull request.

Dealing with data-stream of constant values during a certain period

I do not think the algorithm is well-defined for the case where all points are exactly identical, because you cannot partition the point set. https://klabum.github.io/rrcf/tree-construction.html In this case, you would...

Streaming Data - calling a function when anomaly is detected

Ultimately you will need some kind of threshold test on CoDisp that will be application-dependent. Using a percentile score is a pretty reliable approach. To answer the second part, I...

how to scale rrcf to detect thousands of time series

Here are a few suggestions: - Instead of shingling, I would recommend computing summary statistics that capture the type of anomaly you are looking for. This will reduce the dimension...

The problrm of RRCF training data to get the model

Yes. In this case you would: - Construct a forest from a fixed training set - For each new point in the data stream: - Insert the new point into...

The problrm of RRCF training data to get the model

This should work: ## Train model (same example as in README) ```python import numpy as np import pandas as pd import rrcf # Set parameters np.random.seed(0) n = 2010 d...

The problrm of RRCF training data to get the model

If you want to use shingles, each point inserted into the tree should be of the form: `[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)]`...

QUESTION: Feature importance

To clarify, do you mean: for a set of multidimensional points, which dimension contributes the most to the total codisp over all points in the dataset? These three pages of...