causal-forest icon indicating copy to clipboard operation
causal-forest copied to clipboard

Optimize _find_optimal_split function.

Open timmens opened this issue 5 years ago • 1 comments

Problem: Right now the function _find_optimal_split is very inefficient. In the inner loop over splitting_points I compute means and sums in every iteration, even though I could update an initial value.

Solution: Implement dynamic updating algorithm that finds best splitting point for a given feature index.

timmens avatar Feb 10 '20 17:02 timmens

What has been done: The commits (aca31b1ea3f76964) and (a2f504f9ccfbab8cd9) improve the speed of the inner loop (over observations) by a big margin. In the first commit I changed most np.sum() and np.mean() calls for a dynamic sum extension. In the second commit I swapped pd.DataFrame data storage for the fast np.array and now simply convert the end result to a pd.DataFrame.

What still needs to be done:

  1. The code need to be checked for correctness against the old implementation and unit tests have to be written.
  2. To make the code even faster it has be profiled while numba is disabled, since this allows to check what function calls make _find_optimal_split slow. Current profiling has shown that the function _find_optimal_split is still the only major concern.

timmens avatar Feb 10 '20 22:02 timmens