I was playing around with this library, and I found that because of the heavy usage of numpy (which releases the global interpeter lock), that a simple ThreadPoolExecutor running the internals of the for loops will allow for a 5x speedup on my laptop.

I was wondering if there was any interest in adding this to the library?

This is the API I was thinking would work. Each Ransac class would subclass this, to enable parallelization. Also, I added the ability to pass a seed, so that you can run deterministic ransac tests.

class BaseParallelRansac(ABC):
    def __init__(self, seed=None, n_workers=None):
        self.executor = ThreadPoolExecutor(max_workers=None)
        self.random = Random(seed)

    def fit(
        self, points: np.ndarray, thresh: float = 0.05, max_iteration: int = 5000
    ) -> Tuple[List[int], List[int]]:
        """
        :param points: A numpy array of points, of shape (# points, 3)
        :param thresh: The distance threshold to include points as inliers
        :param max_iteration: How many (parallel) Ransac iterations to run
        :returns:
            best_eq: A list of integers representing the best 'equation' for the primitive shape.
            best_inliers: A list of indices of points that fit the shape.
        """
        best_eq = []
        best_inliers = []
        jobs = ((self.random, points, float(thresh)) for _ in range(max_iteration))
        for eq, point_id_inliers in self.executor.map(self.iteration, *zip(*jobs)):
            if len(point_id_inliers) > len(best_inliers):
                best_eq = eq
                best_inliers = point_id_inliers
        return best_eq, best_inliers

    @staticmethod
    @abstractmethod
    def iteration(
        self, random: Random, points: np.ndarray, thresh: float
    ) -> Tuple[List[int], List[int]]:
        pass

class Cuboid(BaseParallelRansac):
   def iteration(random, points, thresh):
       # Implementation of a the inside of the for loop
       ...

Oct 28 '21 21:10 apockill

Hi @apockill Thank you for your idea. I'm sorry for the late response (Finally finished my master's thesis). I'm getting back to my opensource projects this week and I intend to try your implementation to validate the better execution time.

I already saw some RANSAC implementations similar to your suggestion so I imagine this change will be good.

As soon I get some results I'll return here.

Mar 20 '22 22:03 leomariga

Congratulations on finishing your masters thesis! Let me know if you'd like any help implementing this. Cheers

Mar 21 '22 01:03 apockill

@apockill I manage to make some tests using the ThreadPoolExecutor with the code you suggested. The only primitive I tested was the Plane so my result could be biased. First, some comments on the code you suggested:

Thumbs up for using types for parameters and returns. I think this should be added to our library and I plan to add it soon.
I liked the use of an abstract method that is shared with all primitives. This helps the code maintainability and reuse for future RANSAC implementations. The only drawback I see the it turns the code less clear to understand for someone who is learning about the method, however, that is not main goal of this library.

As for using the ThreadPoolExecutor, maybe I misused this functionality, but unfortunately I couldn't get results as you described. I tested both default implementation and compared it with the threaded code you suggested. I tested varying the number of workers and the max. iteration from the RANSAC and in all cases, the result only got worse when compared to the default implementation. Also, as I increased the number of workers, the more time it took to finish the process, which you can see in the plot below.

time_thread

I didn't do an in depth investigation to why this happened. My theory is that it is taking more time in the math operations which are not related to numpy. Therefore, the time to make the context switch for threads is actually increases the execution time.

I added all tests in the branch related to this issue. The default code is in the following file: testing_parallel_8.py

The threaded implementation is bellow: testing_parallel_7.py

Or, both implementation are on this Jupyter Notebook, together with the time analysis.

If you have any idea I we got different results, or suggestions for this code, I would be glad to hear more about it =)

Thank you.

Mar 27 '22 21:03 leomariga

This is great testing! I was perplexed so I spent some time debugging this, and I found the followin

Case 1

When the iteration() function return converts the numpy array back to a python list

return list(pts), list(pt_id_inliers)

The performance does not scale with threads:

dt:  13.763442993164062 n_workers 1 n_iterations 1000
dt:  18.53112554550171 n_workers 5 n_iterations 1000
dt:  20.15290331840515 n_workers 10 n_iterations 1000
dt:  20.94913625717163 n_workers 50 n_iterations 1000

Case 2

However, when keeping the points in numpy (no need to enter the python land and hold the GIL)

return pts, pt_id_inliers

Then the performance scales nicely, until you get to high thread counts where it tops out

dt:  0.7775411605834961 n_workers 1 n_iterations 1000
dt:  0.27210378646850586 n_workers 5 n_iterations 1000
dt:  0.387392520904541 n_workers 10 n_iterations 1000
dt:  0.6901295185089111 n_workers 50 n_iterations 1000

Case 3

Here I just remove the ThreadPoolExecutor in its entirety

for eq, point_id_inliers in map(self.iteration, *zip(*jobs)):
    if len(point_id_inliers) > len(best_inliers):
        best_eq = eq
        best_inliers = point_id_inliers

This is the performance:

dt:  0.7188742160797119 n_iterations 1000

So I guess the takeaway is that threading helps in the 2-5 worker territory (at least for the 1000 iteration test) but only if you don't convert all of the points to a python list, which holds the GIL for long periods of time.

Mar 28 '22 16:03 apockill

How interesting. I didn't know working with lists could be slow like that. Thank you for your insights.

I made some more optimization by putting numpy on every math operation possible. The best result is the plots below (normal and log) shows the optimal number of workers in my computer, (it might depend on the number of processors on each computer). I think using the best number of workers, the results are around 20% faster than the default implementation.

Screenshot from 2022-03-29 00-06-49

Screenshot from 2022-03-29 00-11-07

I'll try to optimize even further this week to see if I can achieve better results.

Mar 29 '22 03:03 leomariga

pyRANSAC-3D
pyRANSAC-3D copied to clipboard

Interest in threading?

Case 1

Case 2

Case 3

pyRANSAC-3D pyRANSAC-3D copied to clipboard

Interest in threading?

Case 1

Case 2

Case 3

pyRANSAC-3D
pyRANSAC-3D copied to clipboard