rrcf
rrcf copied to clipboard
insert_point is slow
Running a line profiler on insert_point shows the top time consumers:
%lprun -f tree.insert_point tree.insert_point(inlier, index='inlier')
Line # Hits Time Per Hit % Time Line Contents
==============================================================
...
302 13 6400.0 492.3 69.6 bbox = self.get_bbox(node)
303 13 2214.0 170.3 24.1 cut_dimension, cut = self._insert_point_cut(point, bbox)
...
Within _insert_point_cut:
%lprun -f tree._insert_point_cut tree.insert_point(inlier, index='inlier')
Line # Hits Time Per Hit % Time Line Contents
==============================================================
...
520 13 267.0 20.5 10.1 bbox_hat = np.empty(bbox.shape)
521 # Update the bounding box based on the internal point
522 13 363.0 27.9 13.8 bbox_hat[0, :] = np.minimum(bbox[0, :], point)
523 13 808.0 62.2 30.7 bbox_hat[1, :] = np.maximum(bbox[1, :], point)
524 13 85.0 6.5 3.2 b_span = bbox_hat[1, :] - bbox_hat[0, :]
525 13 309.0 23.8 11.7 b_range = b_span.sum()
526 13 213.0 16.4 8.1 r = np.random.uniform(0, b_range)
527 13 246.0 18.9 9.3 span_sum = np.cumsum(b_span)
...
Storing the bounding box for each branch could cut ~70% of time expended.
After storing the bboxes, time breakdown for insert_point is:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
297 1 118.0 118.0 4.3 duplicate = self.find_duplicate(point, tolerance=tolerance)
...
309 13 1936.0 148.9 70.9 cut_dimension, cut = self._insert_point_cut(point, bbox)
...
323 12 138.0 11.5 5.1 if point[node.q] <= node.p:
...
347 1 168.0 168.0 6.2 self._tighten_bbox_upwards(branch)
And within _insert_point_cut:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
573 13 51.0 3.9 2.9 bbox_hat = np.empty(bbox.shape)
574 # Update the bounding box based on the internal point
575 13 320.0 24.6 17.9 bbox_hat[0, :] = np.minimum(bbox[0, :], point)
576 13 510.0 39.2 28.6 bbox_hat[-1, :] = np.maximum(bbox[-1, :], point)
577 13 101.0 7.8 5.7 b_span = bbox_hat[-1, :] - bbox_hat[0, :]
578 13 215.0 16.5 12.1 b_range = b_span.sum()
579 13 149.0 11.5 8.4 r = np.random.uniform(0, b_range)
580 13 170.0 13.1 9.5 span_sum = np.cumsum(b_span)
581 13 19.0 1.5 1.1 cut_dimension = np.inf
582 19 50.0 2.6 2.8 for j in range(len(span_sum)):
583 19 37.0 1.9 2.1 if span_sum[j] >= r:
584 13 13.0 1.0 0.7 cut_dimension = j
585 13 12.0 0.9 0.7 break
586 13 91.0 7.0 5.1 if not np.isfinite(cut_dimension):
587 raise ValueError("Cut dimension is not finite.")
Hello,
This algorithm performs really great but insert_point is really slow. Is there any update on this bottleneck and is it possible to create a batch insert which could potentially run faster?
If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html
As far as speeding up insert_point goes, I'm not really sure. The major time consumers are numpy's built-in maximum and minimum functions, which are already pretty optimized.
Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though.
I've seen this example. Does it matter if all the points are not in each of the trees of your forest? Because I'm currently inserting around 50 000 points by tree (it takes around 200 seconds). Maybe it's a bit too much? :)
I was thinking of a batch_insert_point that take an np.ndarray (l x d).
With this, you can compute the minimum and maximum with axis=0 so that you compute the min and max a single time for all the points. Likewise, find_duplicate could be run one time and maybe other checks you do inside insert_point could be done a unique time.
Parallelization in python is incredibly easy. I already did it myself so I can write it down.
from pathos.multiprocessing import ProcessingPool
from multiprocessing import cpu_count()
def _insert_multiproc(self, features, n_jobs:int = -1):
nb_cpu = cpu_count()
if n_jobs >= nb_cpu or n_jobs == -1:
n_jobs = nb_cpu
pool = ProcessingPool(nodes=cpu_count() - 1)
all_distances = pool.map(RCTree._cmp_detect, [features]*self.nb_trees)
distance = np.mean(np.array(all_distances), axis=0)
thresh = np.quantile(distance, 1 - self.contamination)
@staticmethod
def _cmp_detect(features):
tree = rrcf.RCTree()
[tree.insert_point(features[i], index=i) for i in range(features.shape[0])]
return [tree.codisp(leaf) for leaf in tree.leaves]
Hi @atthom,
Thanks for taking a look at this.
The points don't need to be in every single tree as long as you make sure you're averaging codisp properly. Ultimately, you're generating an estimator for codisp. As a side note, it would be interesting to look at the properties of the codisp estimator as a function of the number of trees, tree size, etc.
Have you tested how much of a speedup you're getting from parallelizing? I briefly experimented with using the multiprocessing module but wasn't able to get a significant speedup.
For batch insertion, you would need to show that the implementation works out mathematically: in other words, that the new tree is drawn from the distribution T' (S' U {p}): https://klabum.github.io/rrcf/insert-and-delete.html
It would probably require a completely different insert_point algorithm, because the current one assumes that points are added sequentially (which changes the bounding boxes/tree structure each time a point is inserted).
I tested the parallelization. It's the perfect use case of parallelization because the trees are independents. Each processor can compute its tree separately. With N processors and K trees in the forest, I nearly got K/N compute time, probably because the codisp computation is inside the multiprocessed function.
Now, I understand why it's not easy to create this insert multiple points! I have to read a bit more about the bounding boxes and the data structure if I want a faster insert point :)
Sounds good. If you want to contribute any parallelization code, feel free to submit a pull request.
If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html
As far as speeding up
insert_pointgoes, I'm not really sure. The major time consumers are numpy's built-in maximum and minimum functions, which are already pretty optimized.Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though.
I wonder if Dask would make sense here?
Definitely worth looking into 👍