rrcf icon indicating copy to clipboard operation
rrcf copied to clipboard

Streaming code

Open ashujainml opened this issue 6 years ago • 5 comments
trafficstars

Hi,

I tested the streaming code( https://klabum.github.io/rrcf/streaming.html) and found that all trees are similar to each other compared to the batch processing code (https://klabum.github.io/rrcf/batch.html). I spent some time understanding the issue and noticed that if we always remove the oldest point from all trees and insert new point in all trees, all trees will be identical in nature.

If tree is above permitted size...

    if len(tree.leaves) > tree_size:
        # Drop the oldest point (FIFO)
        tree.forget_point(index - tree_size)
    # Insert the new point into the tree
    tree.insert_point(point, index=index)

Am I missing something?

ashujainml avatar Sep 20 '19 07:09 ashujainml

I have the same question as @ashujainml. Shouldn't the points inserted in to each tree in the forest be different, similar to what is accomplished in the batch examples with uniform random sampling?

Chmavo avatar Jan 14 '20 19:01 Chmavo

@mdbartos if you have time could you please comment on this issue? I am using the same code as the sine wave streaming example, and when I use a seed value for the random number generator, all of the trees are similar. I get the same results with 1 tree as I do with 40 trees, and I see the same behavior with my own sample data.

Chmavo avatar May 20 '20 18:05 Chmavo

Greetings,

Thanks for the question. Can you tell me what you mean by similar? Do you mean the trees are all exactly the same?

This should not happen unless there's a bug. Even if the trees incorporate the exact same points, the structure of the trees should be different because the partitioning algorithm is randomized.

See, for example:

import numpy as np
import rrcf
np.random.seed(0)

X = np.random.randn(8, 3)

tree_0 = rrcf.RCTree()
tree_1 = rrcf.RCTree()

for index, point in enumerate(X):
    tree_0.insert_point(point, index=index)
    tree_1.insert_point(point, index=index)
print(tree_0)

─+
 ├───+
 │   ├───+
 │   │   ├──(6)
 │   │   └───+
 │   │       ├───+
 │   │       │   ├───+
 │   │       │   │   ├───+
 │   │       │   │   │   ├──(4)
 │   │       │   │   │   └──(2)
 │   │       │   │   └──(0)
 │   │       │   └──(7)
 │   │       └──(5)
 │   └──(1)
 └──(3)
print(tree_1)

─+
 ├───+
 │   ├──(6)
 │   └──(2)
 └───+
     ├───+
     │   ├───+
     │   │   ├───+
     │   │   │   ├──(7)
     │   │   │   └───+
     │   │   │       ├──(3)
     │   │   │       └──(4)
     │   │   └──(5)
     │   └──(0)
     └──(1)

Thus, even with the exact same point set you should be able to build an ensemble. Let me know if this helps.

Thanks, MDB

mdbartos avatar May 21 '20 01:05 mdbartos

@mdbartos thank you very much for the reply. After reviewing your example and replicating it with my data, I believe the error was in my understanding of how random_state worked when constructing the random-cut trees. If i pass an integer value as random_state to RCTree(), I get the same trees every time when I use the same points. For example, the code below yields identical trees:

tree_0 = rrcf.RCTree(random_state=123)
tree_1 = rrcf.RCTree(random_state=123)
X = np.random.randn(8, 3)
for index, point in enumerate(X):
    tree_0.insert_point(point, index=index)
    tree_1.insert_point(point, index=index)

Chmavo avatar May 21 '20 16:05 Chmavo

Ah now I see. Yes, this is the reason that random_state was added--it gives you a way of replicating exactly the same tree (for testing purposes, replicating results, etc.). Otherwise you'd get a different tree each time.

mdbartos avatar May 21 '20 17:05 mdbartos