rrcf icon indicating copy to clipboard operation
rrcf copied to clipboard

inference on new data samples

Open sophark opened this issue 4 years ago • 3 comments

Hi, Thanks for implementing this. I have a use case that need train the rrcf using some given dataset, and then predict on unseen data samples this haven't been seen during training period.

Can we use batch mode to achieve that? One simple solution I can come up with is to first insert that point to the forest, calculate the codisp, and then delete it. I am wondering is there any smarter ways to save the inference time?

Thanks.

sophark avatar Dec 16 '19 22:12 sophark

That's probably the most flexible way to do it (create forest from point set S using batch mode -> insert new point x into each tree -> compute codisp -> delete point x). But yes, it will probably be slow. Parallelizing can help though.

I'm not sure if this is helpful, but note that the insert_point algorithm is guaranteed to produce a tree drawn from RRCF(S \union x), where S is a point set, and x is an additional point.

In other words, the following two trees are statistically indistinguishable, and the codisp of x will be the same in expectation:

  • Create tree T' from point set (S \union x) via batch mode.
  • Create tree T from point set S via batch mode and then insert x, resulting in tree T'.

mdbartos avatar Dec 17 '19 03:12 mdbartos

That's probably the most flexible way to do it (create forest from point set S using batch mode -> insert new point x into each tree -> compute codisp -> delete point x). But yes, it will probably be slow. Parallelizing can help though.

Thanks for your hints. Yes, it indeed a little bit slow without parallelizing. Do you know which step above consume most of time and its time complexity? I guess maybe the insert point step?

sophark avatar Dec 17 '19 19:12 sophark

Yeah, I would say insert_point is the slowest step. I have time breakdowns here: https://github.com/kLabUM/rrcf/issues/28

mdbartos avatar Dec 18 '19 02:12 mdbartos