rrcf icon indicating copy to clipboard operation
rrcf copied to clipboard

The problrm of RRCF training data to get the model

Open Zhoulinfeng0510 opened this issue 5 years ago β€’ 7 comments
trafficstars

Can RRCF obtain a model from the training set data, and then use this model to detect anomalies in the new data stream?

Zhoulinfeng0510 avatar Sep 18 '20 12:09 Zhoulinfeng0510

Yes. In this case you would:

  • Construct a forest from a fixed training set
  • For each new point in the data stream:
    • Insert the new point into each tree
    • Compute the codisp score of the new point for each tree
    • Delete the new point from each tree

You can also use a similar approach for classification: https://klabum.github.io/rrcf/classification.html

mdbartos avatar Sep 18 '20 22:09 mdbartos

yep! I want to know more about the method of obtaining such a model. My current understanding is to use the to_dict function in the API interface. I wonder if this is correct? If so, can you please give me a specific code here? Thank you very much for your reply.

Zhoulinfeng0510 avatar Sep 21 '20 10:09 Zhoulinfeng0510

This should work:

Train model (same example as in README)

import numpy as np
import pandas as pd
import rrcf

# Set parameters
np.random.seed(0)
n = 2010
d = 3
num_trees = 10
tree_size = 10

# Generate data
X = np.zeros((n, d))
X[:1000,0] = 5
X[1000:2000,0] = -5
X += 0.01*np.random.randn(*X.shape)

# Construct forest
forest = []
while len(forest) < num_trees:
    # Select random subsets of points uniformly from point set
    ixs = np.random.choice(n, size=(n // tree_size, tree_size),
                           replace=False)
    # Add sampled trees to forest
    trees = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]
    forest.extend(trees)

Save forest to json file

# Write learned model to json file
import json

# Convert forest to list of dictionaries
out_json = [tree.to_dict() for tree in forest]

# Write forest to file
with open('forest.json', 'w') as outfile:
    json.dump(out_json, outfile)

Read forest from json file

# Read json file into new forest
with open('forest.json', 'r') as infile:
    forest_obj = json.load(infile)
    
new_forest = []
for tree_obj in forest_obj:
    tree = rrcf.RCTree.from_dict(tree_obj)
    new_forest.append(tree)

Compare:

>>> forest[0]

>>> 
─+
 β”œβ”€β”€β”€+
 β”‚   β”œβ”€β”€(6)
 β”‚   └───+
 β”‚       β”œβ”€β”€β”€+
 β”‚       β”‚   β”œβ”€β”€(1)
 β”‚       β”‚   └──(4)
 β”‚       └──(8)
 └───+
     β”œβ”€β”€β”€+
     β”‚   β”œβ”€β”€(0)
     β”‚   └───+
     β”‚       β”œβ”€β”€β”€+
     β”‚       β”‚   β”œβ”€β”€(9)
     β”‚       β”‚   └──(5)
     β”‚       └──(2)
     └───+
         β”œβ”€β”€(3)
         └──(7)
>>> new_forest[0]

>>>
─+
 β”œβ”€β”€β”€+
 β”‚   β”œβ”€β”€(6)
 β”‚   └───+
 β”‚       β”œβ”€β”€β”€+
 β”‚       β”‚   β”œβ”€β”€(1)
 β”‚       β”‚   └──(4)
 β”‚       └──(8)
 └───+
     β”œβ”€β”€β”€+
     β”‚   β”œβ”€β”€(0)
     β”‚   └───+
     β”‚       β”œβ”€β”€β”€+
     β”‚       β”‚   β”œβ”€β”€(9)
     β”‚       β”‚   └──(5)
     β”‚       └──(2)
     └───+
         β”œβ”€β”€(3)
         └──(7)

mdbartos avatar Sep 21 '20 19:09 mdbartos

Okay, I think I already understand how RRCF works like this! Thank you very much! :) After further research, I found another problem: For multi-dimensional streaming data, calculating codisp will be a problem. I used shingle to create a sliding window. This data format is m x n, but the insert_piont function will only process 1 x d data. In this regard, rrcf will have a better way to calculate the anomaly scores of multidimensional and sliding window data?

Zhoulinfeng0510 avatar Sep 28 '20 12:09 Zhoulinfeng0510

If you want to use shingles, each point inserted into the tree should be of the form:

[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)] ... [x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)] ...

And so on.

Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.

mdbartos avatar Sep 30 '20 21:09 mdbartos

Thank you very much for your sincere reply, I have solved the above problem perfectly. However, I have the following problems when using RRCF. In Figure 1, it can be seen that there is a segment in the middle of the data (orange line) with obvious abnormalities. However, in the second picture, the highest anomaly score of the anomaly segment is only 0.25, and the anomaly score of the later segments with little anomaly is occasionally 0.25. This makes me very confused. Figure_1 Figure_2

Zhoulinfeng0510 avatar Oct 15 '20 07:10 Zhoulinfeng0510

If you want to use shingles, each point inserted into the tree should be of the form:

[x_1(t_1), x_1(t_2), ... x_1(t_n), x_2(t_1), x_2(t_2), ... x_2(t_n), ... x_m(t_1), x_m(t_2), ... x_m(t_n)] ... [x_1(t_2), x_1(t_3), ... x_1(t_n+1), x_2(t_2), x_2(t_3), ... x_2(t_n+1), ... x_m(t_2), x_m(t_3), ... x_m(t_n+1)] ...

And so on.

Each point will be of dimension (1 x nm) where n is the shingle size and m is the number of variables.

This should be added to the doc example (didn't see it, either I miss it or not documented).

yasirroni avatar Oct 26 '20 14:10 yasirroni