Optimize aamp for `p==2` (and maybe `p==1`)

Open NimaSarajpoor opened this issue 1 year ago • 13 comments

Currently, aamp accepts argument p (with default 2.0) to calculate p-norm distance. According to a code written by scipy community, it is better to use np.square(..) when p is 2.

Let's compare

aamp	n=10_000, m=50	n=20_000, m=50	n=50_000, m=50
MAIN_BRANCH	0.8	3	20
MAIN_BRANCH-OPTIMZED	0.7	0.3	2

The OPTIMIZED version is by adding the two following functions in stumpy/core.py, and use them in stumpy/aamp.py

# in core.py

@njit(fastmath=True)
def _calculate_P_norm(a, p=2.0):
    # a is array
    if p == 2.0:
        return np.sum(np.square(a))
    else:
        return np.linalg.norm(a, ord=p) ** p


@njit(fastmath=True)
def _rolling_update_P_norm(a, old, new, p=2.0):
    # a is array
    if p == 2.0:
        return a - np.square(old) + np.square(new)
    else:
        return a - np.absolute(old) ** p + np.absolute(new) ** p

We may want to investigate p==1 case as well and do the same. I mean...we may do: np.sum(np.abs(...)))

p==1 and p==2 are the most common cases and it would be good to reduce computing time for these two cases.

Nov 20 '22 21:11 NimaSarajpoor

@seanlaw Do you think we should take care of this first? (It can help us get the computing time in top-k faster)

Nov 20 '22 21:11 NimaSarajpoor

Sure! Might as well take care of p=1 as well if there is something there. Will all non-normalized functions use this?

Nov 20 '22 22:11 seanlaw

I think it will be used in aamp, gpu_aamp, scraamp, aampi.

We also have the argument p in core._mass_absolute_distance_matrix. However, this function uses scipy.spatial.distance.cdist to calculate p-norm. Hence, regarding this latter case, I think we can leave it as is since I think scipy has already considerred it in cdist (I have not checked the source file of cidst yet)

Nov 20 '22 22:11 NimaSarajpoor

I think it will be used in aamp, gpu_aamp, scraamp, aampi.

Where in gpu_aamp would it be used? Which function?

What about maamp? Let's make sure to double check all of the possible places where aamp exists

Nov 20 '22 23:11 seanlaw

UPDATE

Please allow me first to provide a little info regarding the impact of the two proposed function (see the first comment) on the computing time:

In what follows:

func1 denotes _calculate_P_norm (proposed in the first comment in this issue)
func2 denotes _rolling_update_P_norm (proposed in the first comment in this issue)

Note that func1 will be used when we want to calculate p_norm directly using all the elements of subsequences. And, func2 will be used when we want to calculate p_norm via rolling approach (i.e. using the previously-calculated p_norm to calculate p_norm for a new pair of subsequences)

In stumpy/aamp.py, we can see the following lines of codes in _compute_diagonal:

if uint64_i == 0 or uint64_j == 0:
   p_norm = (
        np.linalg.norm(
        T_B[uint64_j : uint64_j + uint64_m]
        - T_A[uint64_i : uint64_i + uint64_m],
        ord=p,
    )
    ** p
)
else:
    p_norm = np.abs(
    p_norm
    - np.absolute(T_B[uint64_j - uint64_1] - T_A[uint64_i - uint64_1])
    ** p
    + np.absolute(
        T_B[uint64_j + uint64_m - uint64_1]
        - T_A[uint64_i + uint64_m - uint64_1]
    )
    ** p
)

Now, by using func1 and func2, we will have:

if uint64_i == 0 or uint64_j == 0:
    p_norm =func1(..)   # calculate p_norm directly using all elements of subsequence

else:
    p_norm = func2(...) # rolling approach: use previously-calculated p_norm to calculate new p_norm

Tables of Performances Note: The func1 and func2 in tables below is slightly different than the ones proposed in the first comment as they are now taking care of p==1 as well.

aamp, m=50, p=1	n=1000	n=5000	n=10000	n=25000	n=50000
main	0.003	0.046	0.186	1.16	5.58
using func1	0.004	0.05	0.186	1.142	5.37
using func1 and func2	0.002	0.023	0.09	0.56	2.19

aamp, m=50, p=2	n=1000	n=5000	n=10000	n=25000	n=50000
main	0.009	0.2	0.81	5.38	25.88
using func1	0.132	0.21	0.81	5.11	25.13
using func1 and func2	0.003	0.03	0.087	0.54	2.2

Note that func2 has more impact compared to func1. That is because we need to directly calculate p_norm for just a few cases, and for most pair of subsequences, we just use the rolling approach, i.e. func2, to calculate p_norm.

The improvement in computing time for p=1 and n=50_000 is 60% The improvement in computing time for p=2 and n=50_000 is 90%

gpu_aamp

In gpu_aamp.py, we can see the following lines:

PART-I:

# in function gpu_aamp

for idx, start in enumerate(range(0, l, step)):
    stop = min(l, start + step)
    
    p_norm = np.power(core.mass_absolute(T_B[start : start + m], T_A, p=p), p)
    p_norm_first = np.power(core.mass_absolute(T_A[:m], T_B, p=p), p)

Here, we can investigate and see if we can improve the computing time. Note that we use core.mass_absolute, which calls scipy cdist and it might be already fast (or okay)

PART-II:

# in function _compute_and_update_PI_kernel

if compute_p_norm:
            p_norm_out[j] = (
                p_norm_in[j - 1]
                - abs(T_B[i - 1] - T_A[j - 1]) ** p
                + abs(T_B[i + m - 1] - T_A[j + m - 1]) ** p
            )

I think this is similar to the updating process in _aamp, and thus we can use func2 to speed up the process when p==1 or p==2.

maamp

I can see the following cases in different parts of the codes

core.mass_absolute(..., p=p)
D = np.linalg.norm(disc_subseqs - disc_neighbors, axis=1, ord=p)
np.power(core.mass_absolute(...))

I feel the three items above should be fine. But, still, we should investigate.

we can also see the following block:

if idx % 2 == 0:
                # Even
                p_norm_even[i, j] = (
                    p_norm_odd[i, j - 1]
                    - abs(T[i, idx - 1] - T[i, j - 1]) ** p
                    + abs(T[i, idx + m - 1] - T[i, j + m - 1]) ** p
                )
            else:
                # Odd
                p_norm_odd[i, j] = (
                    p_norm_even[i, j - 1]
                    - abs(T[i, idx - 1] - T[i, j - 1]) ** p
                    + abs(T[i, idx + m - 1] - T[i, j + m - 1]) ** p
                )

we may try to optimize it (this seems to be similar to the updating process).

For now, I just investigated aamp. So, I am not 100% sure about the things I said about gpu_aamp and maamp. Those are just my observations :)

Nov 21 '22 01:11 NimaSarajpoor

@seanlaw Please let me know what you think. I already created a branch in my local repo, and created and tested the new functions. Do you want to find/discuss all possible places for this optimization opportunity here in this issue page? Or, do you prefer a step-by-step PR?

Nov 21 '22 02:11 NimaSarajpoor

Now, by using func1 and func2, we will have:

Great! I don't love rolling_update_p_norm as it isn't quite consistent with the other "rolling" functions that we've named.

I think this is similar to the updating process in _aamp, and thus we can use func2 to speed up the process when p==1 or p==2.

But func2 targets CPUs and so it cannot be used/called in _compute_and_update_PI_kernel, which targets GPUs

Please let me know what you think.

Hmmm, now that I think about it. Let's put this one on pause for now until we've wrapped up all of the other more important stuff like fixing precision issues. I'm not sure that this is an immediate "value add" for users while many of the other stability related things are a definite value-add!

Nov 21 '22 02:11 seanlaw

I don't love rolling_update_p_norm as it isn't quite consistent with the other "rolling" functions that we've named.

Yeah...I know! I tried (a little bit) to find a better name but couldn't.

But func2 targets CPUs and so it cannot be used/called in _compute_and_update_PI_kernel, which targets GPUs

Right! ~The code is simple though and it should be easy to write it as device function. (but, we are then repeating ourselves)~

Hmmm, now that I think about it. Let's put this one on pause for now until we've wrapped up all of the other more important stuff like fixing precision issues. I'm not sure that this is an immediate "value add" for users while many of the other stability related things are a definite value-add!

Sure :)

The thing that triggered my mind was that the computing time for aamp(..., p=2.0) was waaay more than stump, and that did not make sense to me. ~Because, rolling cov doesn't seem to have lower number of operations than rolling p_norm (when p==2.0). So, their running time should have been very close to each other.~

I will leave it here, and will try to resume recording the computing time for top-k (normalized=False).

Nov 21 '22 02:11 NimaSarajpoor

@seanlaw I know you told me to leave this for now... but I think you will find this VERY interesting!!! (I do apologize for doing this last investigation 😄)

I realized that there is a huge difference in the computing time between p==2.0 and p==2 !!!

@njit(fastmath=True)
def my_func(a, p):
    n = len(a)
    x = 0
    for i in range(n):
        for j in range(n):
            x = x + (a[i] ** p) + (a[j] ** p)
    
    return x

And, the input:

seed = 0
np.random.seed(seed)
# n = 1000, 5000, 10000, 20000
a = np.random.rand(n)

	n=1000	n=5000	n=10000	n=20000
case1, p-2.0	0.036	0.77	3.03	12.05
case2, p=2	0.008	0.14	0.52	2.04
improvement	78%	82%	83%	83%

I am not saying this is the only place that code can be optimized, but it seems that this is a very important thing!

Maybe, we do something similar to core.check_window_size(m, ...):

# in core.py
def check_order(p):
      if int(p) == p:
             return int(p)
      else:
         return p

	p=1	p=2	p=3	p=4
p: float, n=20000	1.43	12.04	12.02	12.13
p: int, n=20000	1.43	2.05	2.06	2.8
improvement	0%	83%	83%	77%

Btw, I posted a question on numba discourse: https://numba.discourse.group/t/strange-behavior-in-the-performance-of-operator-p/1658

Nov 21 '22 07:11 NimaSarajpoor

@NimaSarajpoor That is interesting! It seems that raising a float to a power of another float is more computationally expensive than raising it to a power of int

Nov 21 '22 15:11 seanlaw

If p is an integer, the compiler can unroll the loop inside pow(x, 2) to x*x. If p is a float, the compiler emits code that is valid for all possible float values; thererby not allowing many optimizations.

Nov 21 '22 15:11 JaKasb

If p is an integer, the compiler can unroll the loop inside pow(x, 2) to x*x. If p is a float, the compiler emits code that is valid for all possible float values; thererby not allowing many optimizations.

Very insightful! Thanks for sharing @JaKasb!

Nov 21 '22 15:11 seanlaw

Maybe, we do something similar to core.check_window_size(m, ...):

@NimaSarajpoor I think we may be able to use a built-in Python function:

def _check_order(p):
    if float(p).is_integer():
        return int(p)
    else:
        return p

Nov 21 '22 15:11 seanlaw

stumpy stumpy copied to clipboard

Optimize aamp for `p==2` (and maybe `p==1`)

UPDATE

gpu_aamp

maamp

stumpy
stumpy copied to clipboard