similarity_measures
similarity_measures copied to clipboard
Performance improvement proposal: using vectorization in curve_length_measure function
Hi, I hope everything is well.
By looking at the code in curve_length_measure
, the way in which r_sq
did not seem to leverage vectorization, as all elements were calculated one at a time in a loop:
r_sq = np.zeros(n)
for i in range(0, n):
lieq = le_sum[i]*(lc_nj/le_nj)
xtemp = np.interp(lieq, lc_sum, x_c)
ytemp = np.interp(lieq, lc_sum, y_c)
r_sq[i] = np.log(1.0 + (np.abs((xtemp-x_e[i])/xmean)))**2 + \
np.log(1.0 + (np.abs((ytemp-y_e[i])/ymean)))**2
Rewriting it without the loop should leverage numpy vectorization:
factor = lc_nj/le_nj
lieq = le_sum * factor
xinterp = np.interp(lieq, lc_sum, x_c)
yinterp = np.interp(lieq, lc_sum, y_c)
r_sq = np.log(1.0 + (np.abs(xinterp-x_e)/xmean))**2 + \
np.log(1.0 + (np.abs(yinterp-y_e)/ymean))**2
On my machine this led to a performance improvement of more than 3x, taking the average execution time from for a benchmark from 0.055
seconds to 0.015
seconds.
The benchmark consisted on the curves curve1
and curve2
from tests/tests.py
whose execution time was taken with a script like:
import similaritymeasures
import timeit
import numpy as np
x1 = np.linspace(0.0, 1.0, 500)
y1 = np.ones(500)*2
x2 = np.linspace(0.0, 1.0, 250)
y2 = np.ones(250)
curve1 = np.array((x1, y1)).T
curve2 = np.array((x2, y2)).T
def run_curve_length_measure_c1_c2():
return similaritymeasures.curve_length_measure(curve1, curve2)
n_repeats = 50
n_runs = 20
times_list = timeit.repeat(run_curve_length_measure_c1_c2, repeat = n_repeats, number = n_runs)
total = sum(times_list)
avg = total / len(times_list)
print(total)
print(avg)
I hope this can be of interest