similarity_measures icon indicating copy to clipboard operation
similarity_measures copied to clipboard

Performance improvement proposal: using vectorization in curve_length_measure function

Open nucccc opened this issue 5 months ago • 0 comments

Hi, I hope everything is well.

By looking at the code in curve_length_measure, the way in which r_sq did not seem to leverage vectorization, as all elements were calculated one at a time in a loop:

r_sq = np.zeros(n)
for i in range(0, n):
    lieq = le_sum[i]*(lc_nj/le_nj)
    xtemp = np.interp(lieq, lc_sum, x_c)
    ytemp = np.interp(lieq, lc_sum, y_c)

    r_sq[i] = np.log(1.0 + (np.abs((xtemp-x_e[i])/xmean)))**2 + \
        np.log(1.0 + (np.abs((ytemp-y_e[i])/ymean)))**2

Rewriting it without the loop should leverage numpy vectorization:

factor = lc_nj/le_nj

lieq = le_sum * factor

xinterp = np.interp(lieq, lc_sum, x_c)
yinterp = np.interp(lieq, lc_sum, y_c)

r_sq = np.log(1.0 + (np.abs(xinterp-x_e)/xmean))**2 + \
        np.log(1.0 + (np.abs(yinterp-y_e)/ymean))**2

On my machine this led to a performance improvement of more than 3x, taking the average execution time from for a benchmark from 0.055 seconds to 0.015 seconds.

The benchmark consisted on the curves curve1 and curve2 from tests/tests.py whose execution time was taken with a script like:

import similaritymeasures

import timeit
import numpy as np

x1 = np.linspace(0.0, 1.0, 500)
y1 = np.ones(500)*2
x2 = np.linspace(0.0, 1.0, 250)
y2 = np.ones(250)

curve1 = np.array((x1, y1)).T
curve2 = np.array((x2, y2)).T

def run_curve_length_measure_c1_c2():
    return similaritymeasures.curve_length_measure(curve1, curve2)

n_repeats = 50
n_runs = 20

times_list = timeit.repeat(run_curve_length_measure_c1_c2, repeat = n_repeats, number = n_runs)

total = sum(times_list)
avg = total / len(times_list)

print(total)
print(avg)

I hope this can be of interest

nucccc avatar Aug 24 '24 15:08 nucccc