rrcf insert_point is slow

trafficstars

Running a line profiler on insert_point shows the top time consumers:

%lprun -f tree.insert_point tree.insert_point(inlier, index='inlier')

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   ...
   302        13       6400.0    492.3     69.6              bbox = self.get_bbox(node)
   303        13       2214.0    170.3     24.1              cut_dimension, cut = self._insert_point_cut(point, bbox)
   ...

Within _insert_point_cut:

%lprun -f tree._insert_point_cut tree.insert_point(inlier, index='inlier')

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   ...
   520        13        267.0     20.5     10.1          bbox_hat = np.empty(bbox.shape)
   521                                                   # Update the bounding box based on the internal point
   522        13        363.0     27.9     13.8          bbox_hat[0, :] = np.minimum(bbox[0, :], point)
   523        13        808.0     62.2     30.7          bbox_hat[1, :] = np.maximum(bbox[1, :], point)
   524        13         85.0      6.5      3.2          b_span = bbox_hat[1, :] - bbox_hat[0, :]
   525        13        309.0     23.8     11.7          b_range = b_span.sum()
   526        13        213.0     16.4      8.1          r = np.random.uniform(0, b_range)
   527        13        246.0     18.9      9.3          span_sum = np.cumsum(b_span)
   ...

Nov 17 '18 22:11 mdbartos

Storing the bounding box for each branch could cut ~70% of time expended.

Nov 17 '18 22:11 mdbartos

After storing the bboxes, time breakdown for insert_point is:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   297         1        118.0    118.0      4.3          duplicate = self.find_duplicate(point, tolerance=tolerance)
   ...
   309        13       1936.0    148.9     70.9              cut_dimension, cut = self._insert_point_cut(point, bbox)
   ...
   323        12        138.0     11.5      5.1                  if point[node.q] <= node.p:
   ...
   347         1        168.0    168.0      6.2          self._tighten_bbox_upwards(branch)

And within _insert_point_cut:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   573        13         51.0      3.9      2.9          bbox_hat = np.empty(bbox.shape)
   574                                                   # Update the bounding box based on the internal point
   575        13        320.0     24.6     17.9          bbox_hat[0, :] = np.minimum(bbox[0, :], point)
   576        13        510.0     39.2     28.6          bbox_hat[-1, :] = np.maximum(bbox[-1, :], point)
   577        13        101.0      7.8      5.7          b_span = bbox_hat[-1, :] - bbox_hat[0, :]
   578        13        215.0     16.5     12.1          b_range = b_span.sum()
   579        13        149.0     11.5      8.4          r = np.random.uniform(0, b_range)
   580        13        170.0     13.1      9.5          span_sum = np.cumsum(b_span)
   581        13         19.0      1.5      1.1          cut_dimension = np.inf
   582        19         50.0      2.6      2.8          for j in range(len(span_sum)):
   583        19         37.0      1.9      2.1              if span_sum[j] >= r:
   584        13         13.0      1.0      0.7                  cut_dimension = j
   585        13         12.0      0.9      0.7                  break
   586        13         91.0      7.0      5.1          if not np.isfinite(cut_dimension):
   587                                                       raise ValueError("Cut dimension is not finite.")

Nov 18 '18 08:11 mdbartos

Hello,

This algorithm performs really great but insert_point is really slow. Is there any update on this bottleneck and is it possible to create a batch insert which could potentially run faster?

Apr 10 '19 22:04 atthom

If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html

As far as speeding up insert_point goes, I'm not really sure. The major time consumers are numpy's built-in maximum and minimum functions, which are already pretty optimized.

Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though.

Apr 11 '19 07:04 mdbartos

I've seen this example. Does it matter if all the points are not in each of the trees of your forest? Because I'm currently inserting around 50 000 points by tree (it takes around 200 seconds). Maybe it's a bit too much? :)

I was thinking of a batch_insert_point that take an np.ndarray (l x d). With this, you can compute the minimum and maximum with axis=0 so that you compute the min and max a single time for all the points. Likewise, find_duplicate could be run one time and maybe other checks you do inside insert_point could be done a unique time.

Apr 11 '19 08:04 atthom

Parallelization in python is incredibly easy. I already did it myself so I can write it down.

from pathos.multiprocessing import ProcessingPool
from multiprocessing import cpu_count()

    def _insert_multiproc(self, features, n_jobs:int = -1):
        nb_cpu = cpu_count()
        if n_jobs >= nb_cpu or n_jobs == -1:
            n_jobs = nb_cpu

        pool = ProcessingPool(nodes=cpu_count() - 1)
        all_distances = pool.map(RCTree._cmp_detect, [features]*self.nb_trees)
        distance = np.mean(np.array(all_distances), axis=0)
        thresh = np.quantile(distance, 1 - self.contamination)


    @staticmethod
    def _cmp_detect(features):
        tree = rrcf.RCTree() 
        [tree.insert_point(features[i], index=i) for i in range(features.shape[0])]
        return [tree.codisp(leaf) for leaf in tree.leaves]

Apr 11 '19 09:04 atthom

Hi @atthom,

Thanks for taking a look at this.

The points don't need to be in every single tree as long as you make sure you're averaging codisp properly. Ultimately, you're generating an estimator for codisp. As a side note, it would be interesting to look at the properties of the codisp estimator as a function of the number of trees, tree size, etc.

Have you tested how much of a speedup you're getting from parallelizing? I briefly experimented with using the multiprocessing module but wasn't able to get a significant speedup.

For batch insertion, you would need to show that the implementation works out mathematically: in other words, that the new tree is drawn from the distribution T' (S' U {p}): https://klabum.github.io/rrcf/insert-and-delete.html

It would probably require a completely different insert_point algorithm, because the current one assumes that points are added sequentially (which changes the bounding boxes/tree structure each time a point is inserted).

Apr 11 '19 19:04 mdbartos

I tested the parallelization. It's the perfect use case of parallelization because the trees are independents. Each processor can compute its tree separately. With N processors and K trees in the forest, I nearly got K/N compute time, probably because the codisp computation is inside the multiprocessed function.

Now, I understand why it's not easy to create this insert multiple points! I have to read a bit more about the bounding boxes and the data structure if I want a faster insert point :)

Apr 12 '19 13:04 atthom

Sounds good. If you want to contribute any parallelization code, feel free to submit a pull request.

Apr 12 '19 22:04 mdbartos

If using batch mode is an option, it can improve speeds by quite a bit. For the taxicab dataset, changing from streaming to batch mode brought the total computation time from ~20 minutes down to 20-30 seconds. See this example: https://klabum.github.io/rrcf/taxi.html

As far as speeding up insert_point goes, I'm not really sure. The major time consumers are numpy's built-in maximum and minimum functions, which are already pretty optimized.

Because trees are independent, benefits could be gained through parallelization. My knowledge of parallelization in python is pretty limited though.

I wonder if Dask would make sense here?

May 28 '19 10:05 hokiegeek2

Definitely worth looking into 👍

May 28 '19 17:05 mdbartos

rrcf rrcf copied to clipboard

insert_point is slow

rrcf
rrcf copied to clipboard