stumpy
stumpy copied to clipboard
Optimize aamp for `p==2` (and maybe `p==1`)
Currently, aamp accepts argument p (with default 2.0) to calculate p-norm distance. According to a code written by scipy community, it is better to use np.square(..) when p is 2.
Let's compare
| aamp | n=10_000, m=50 | n=20_000, m=50 | n=50_000, m=50 |
|---|---|---|---|
| MAIN_BRANCH | 0.8 | 3 | 20 |
| MAIN_BRANCH-OPTIMZED | 0.7 | 0.3 | 2 |
The OPTIMIZED version is by adding the two following functions in stumpy/core.py, and use them in stumpy/aamp.py
# in core.py
@njit(fastmath=True)
def _calculate_P_norm(a, p=2.0):
# a is array
if p == 2.0:
return np.sum(np.square(a))
else:
return np.linalg.norm(a, ord=p) ** p
@njit(fastmath=True)
def _rolling_update_P_norm(a, old, new, p=2.0):
# a is array
if p == 2.0:
return a - np.square(old) + np.square(new)
else:
return a - np.absolute(old) ** p + np.absolute(new) ** p
We may want to investigate p==1 case as well and do the same. I mean...we may do: np.sum(np.abs(...)))
p==1 and p==2 are the most common cases and it would be good to reduce computing time for these two cases.
@seanlaw Do you think we should take care of this first? (It can help us get the computing time in top-k faster)
Sure! Might as well take care of p=1 as well if there is something there. Will all non-normalized functions use this?
I think it will be used in aamp, gpu_aamp, scraamp, aampi.
We also have the argument p in core._mass_absolute_distance_matrix. However, this function uses scipy.spatial.distance.cdist to calculate p-norm. Hence, regarding this latter case, I think we can leave it as is since I think scipy has already considerred it in cdist (I have not checked the source file of cidst yet)
I think it will be used in aamp, gpu_aamp, scraamp, aampi.
Where in gpu_aamp would it be used? Which function?
What about maamp? Let's make sure to double check all of the possible places where aamp exists
UPDATE
Please allow me first to provide a little info regarding the impact of the two proposed function (see the first comment) on the computing time:
In what follows:
func1denotes_calculate_P_norm(proposed in the first comment in this issue)func2denotes_rolling_update_P_norm(proposed in the first comment in this issue)
Note that func1 will be used when we want to calculate p_norm directly using all the elements of subsequences. And, func2 will be used when we want to calculate p_norm via rolling approach (i.e. using the previously-calculated p_norm to calculate p_norm for a new pair of subsequences)
In stumpy/aamp.py, we can see the following lines of codes in _compute_diagonal:
if uint64_i == 0 or uint64_j == 0:
p_norm = (
np.linalg.norm(
T_B[uint64_j : uint64_j + uint64_m]
- T_A[uint64_i : uint64_i + uint64_m],
ord=p,
)
** p
)
else:
p_norm = np.abs(
p_norm
- np.absolute(T_B[uint64_j - uint64_1] - T_A[uint64_i - uint64_1])
** p
+ np.absolute(
T_B[uint64_j + uint64_m - uint64_1]
- T_A[uint64_i + uint64_m - uint64_1]
)
** p
)
Now, by using func1 and func2, we will have:
if uint64_i == 0 or uint64_j == 0:
p_norm =func1(..) # calculate p_norm directly using all elements of subsequence
else:
p_norm = func2(...) # rolling approach: use previously-calculated p_norm to calculate new p_norm
Tables of Performances
Note: The func1 and func2 in tables below is slightly different than the ones proposed in the first comment as they are now taking care of p==1 as well.
| aamp, m=50, p=1 | n=1000 | n=5000 | n=10000 | n=25000 | n=50000 |
|---|---|---|---|---|---|
| main | 0.003 | 0.046 | 0.186 | 1.16 | 5.58 |
| using func1 | 0.004 | 0.05 | 0.186 | 1.142 | 5.37 |
| using func1 and func2 | 0.002 | 0.023 | 0.09 | 0.56 | 2.19 |
| aamp, m=50, p=2 | n=1000 | n=5000 | n=10000 | n=25000 | n=50000 |
|---|---|---|---|---|---|
| main | 0.009 | 0.2 | 0.81 | 5.38 | 25.88 |
| using func1 | 0.132 | 0.21 | 0.81 | 5.11 | 25.13 |
| using func1 and func2 | 0.003 | 0.03 | 0.087 | 0.54 | 2.2 |
Note that func2 has more impact compared to func1. That is because we need to directly calculate p_norm for just a few cases, and for most pair of subsequences, we just use the rolling approach, i.e. func2, to calculate p_norm.
The improvement in computing time for p=1 and n=50_000 is 60%
The improvement in computing time for p=2 and n=50_000 is 90%
gpu_aamp
In gpu_aamp.py, we can see the following lines:
PART-I:
# in function gpu_aamp
for idx, start in enumerate(range(0, l, step)):
stop = min(l, start + step)
p_norm = np.power(core.mass_absolute(T_B[start : start + m], T_A, p=p), p)
p_norm_first = np.power(core.mass_absolute(T_A[:m], T_B, p=p), p)
Here, we can investigate and see if we can improve the computing time. Note that we use core.mass_absolute, which calls scipy cdist and it might be already fast (or okay)
PART-II:
# in function _compute_and_update_PI_kernel
if compute_p_norm:
p_norm_out[j] = (
p_norm_in[j - 1]
- abs(T_B[i - 1] - T_A[j - 1]) ** p
+ abs(T_B[i + m - 1] - T_A[j + m - 1]) ** p
)
I think this is similar to the updating process in _aamp, and thus we can use func2 to speed up the process when p==1 or p==2.
maamp
I can see the following cases in different parts of the codes
-
core.mass_absolute(..., p=p) -
D = np.linalg.norm(disc_subseqs - disc_neighbors, axis=1, ord=p) -
np.power(core.mass_absolute(...))
I feel the three items above should be fine. But, still, we should investigate.
- we can also see the following block:
if idx % 2 == 0:
# Even
p_norm_even[i, j] = (
p_norm_odd[i, j - 1]
- abs(T[i, idx - 1] - T[i, j - 1]) ** p
+ abs(T[i, idx + m - 1] - T[i, j + m - 1]) ** p
)
else:
# Odd
p_norm_odd[i, j] = (
p_norm_even[i, j - 1]
- abs(T[i, idx - 1] - T[i, j - 1]) ** p
+ abs(T[i, idx + m - 1] - T[i, j + m - 1]) ** p
)
we may try to optimize it (this seems to be similar to the updating process).
For now, I just investigated aamp. So, I am not 100% sure about the things I said about gpu_aamp and maamp. Those are just my observations :)
@seanlaw Please let me know what you think. I already created a branch in my local repo, and created and tested the new functions. Do you want to find/discuss all possible places for this optimization opportunity here in this issue page? Or, do you prefer a step-by-step PR?
Now, by using func1 and func2, we will have:
Great! I don't love rolling_update_p_norm as it isn't quite consistent with the other "rolling" functions that we've named.
I think this is similar to the updating process in _aamp, and thus we can use func2 to speed up the process when p==1 or p==2.
But func2 targets CPUs and so it cannot be used/called in _compute_and_update_PI_kernel, which targets GPUs
Please let me know what you think.
Hmmm, now that I think about it. Let's put this one on pause for now until we've wrapped up all of the other more important stuff like fixing precision issues. I'm not sure that this is an immediate "value add" for users while many of the other stability related things are a definite value-add!
I don't love
rolling_update_p_normas it isn't quite consistent with the other "rolling" functions that we've named.
Yeah...I know! I tried (a little bit) to find a better name but couldn't.
But
func2targets CPUs and so it cannot be used/called in_compute_and_update_PI_kernel, which targets GPUs
Right! ~The code is simple though and it should be easy to write it as device function. (but, we are then repeating ourselves)~
Hmmm, now that I think about it. Let's put this one on pause for now until we've wrapped up all of the other more important stuff like fixing precision issues. I'm not sure that this is an immediate "value add" for users while many of the other stability related things are a definite value-add!
Sure :)
The thing that triggered my mind was that the computing time for aamp(..., p=2.0) was waaay more than stump, and that did not make sense to me. ~Because, rolling cov doesn't seem to have lower number of operations than rolling p_norm (when p==2.0). So, their running time should have been very close to each other.~
I will leave it here, and will try to resume recording the computing time for top-k (normalized=False).
@seanlaw I know you told me to leave this for now... but I think you will find this VERY interesting!!! (I do apologize for doing this last investigation 😄)
I realized that there is a huge difference in the computing time between p==2.0 and p==2 !!!
@njit(fastmath=True)
def my_func(a, p):
n = len(a)
x = 0
for i in range(n):
for j in range(n):
x = x + (a[i] ** p) + (a[j] ** p)
return x
And, the input:
seed = 0
np.random.seed(seed)
# n = 1000, 5000, 10000, 20000
a = np.random.rand(n)
| n=1000 | n=5000 | n=10000 | n=20000 | |
|---|---|---|---|---|
| case1, p-2.0 | 0.036 | 0.77 | 3.03 | 12.05 |
| case2, p=2 | 0.008 | 0.14 | 0.52 | 2.04 |
| improvement | 78% | 82% | 83% | 83% |
I am not saying this is the only place that code can be optimized, but it seems that this is a very important thing!
Maybe, we do something similar to core.check_window_size(m, ...):
# in core.py
def check_order(p):
if int(p) == p:
return int(p)
else:
return p
| p=1 | p=2 | p=3 | p=4 | |
|---|---|---|---|---|
| p: float, n=20000 | 1.43 | 12.04 | 12.02 | 12.13 |
| p: int, n=20000 | 1.43 | 2.05 | 2.06 | 2.8 |
| improvement | 0% | 83% | 83% | 77% |
Btw, I posted a question on numba discourse: https://numba.discourse.group/t/strange-behavior-in-the-performance-of-operator-p/1658
@NimaSarajpoor That is interesting! It seems that raising a float to a power of another float is more computationally expensive than raising it to a power of int
If p is an integer, the compiler can unroll the loop inside pow(x, 2) to x*x. If p is a float, the compiler emits code that is valid for all possible float values; thererby not allowing many optimizations.
If p is an integer, the compiler can unroll the loop inside pow(x, 2) to x*x. If p is a float, the compiler emits code that is valid for all possible float values; thererby not allowing many optimizations.
Very insightful! Thanks for sharing @JaKasb!
Maybe, we do something similar to core.check_window_size(m, ...):
@NimaSarajpoor I think we may be able to use a built-in Python function:
def _check_order(p):
if float(p).is_integer():
return int(p)
else:
return p