stumpy
stumpy copied to clipboard
Optimize aamp for `p==2` (and maybe `p==1`)
Currently, aamp
accepts argument p
(with default 2.0
) to calculate p-norm distance. According to a code written by scipy community, it is better to use np.square(..)
when p is 2
.
Let's compare
aamp | n=10_000, m=50 | n=20_000, m=50 | n=50_000, m=50 |
---|---|---|---|
MAIN_BRANCH | 0.8 | 3 | 20 |
MAIN_BRANCH-OPTIMZED | 0.7 | 0.3 | 2 |
The OPTIMIZED
version is by adding the two following functions in stumpy/core.py
, and use them in stumpy/aamp.py
# in core.py
@njit(fastmath=True)
def _calculate_P_norm(a, p=2.0):
# a is array
if p == 2.0:
return np.sum(np.square(a))
else:
return np.linalg.norm(a, ord=p) ** p
@njit(fastmath=True)
def _rolling_update_P_norm(a, old, new, p=2.0):
# a is array
if p == 2.0:
return a - np.square(old) + np.square(new)
else:
return a - np.absolute(old) ** p + np.absolute(new) ** p
We may want to investigate p==1
case as well and do the same. I mean...we may do: np.sum(np.abs(...))
)
p==1
and p==2
are the most common cases and it would be good to reduce computing time for these two cases.
@seanlaw Do you think we should take care of this first? (It can help us get the computing time in top-k faster)
Sure! Might as well take care of p=1 as well if there is something there. Will all non-normalized functions use this?
I think it will be used in aamp
, gpu_aamp
, scraamp
, aampi
.
We also have the argument p
in core._mass_absolute_distance_matrix
. However, this function uses scipy.spatial.distance.cdist
to calculate p-norm. Hence, regarding this latter case, I think we can leave it as is since I think scipy has already considerred it in cdist
(I have not checked the source file of cidst
yet)
I think it will be used in aamp, gpu_aamp, scraamp, aampi.
Where in gpu_aamp
would it be used? Which function?
What about maamp
? Let's make sure to double check all of the possible places where aamp
exists
UPDATE
Please allow me first to provide a little info regarding the impact of the two proposed function (see the first comment) on the computing time:
In what follows:
-
func1
denotes_calculate_P_norm
(proposed in the first comment in this issue) -
func2
denotes_rolling_update_P_norm
(proposed in the first comment in this issue)
Note that func1
will be used when we want to calculate p_norm
directly using all the elements of subsequences. And, func2
will be used when we want to calculate p_norm
via rolling approach (i.e. using the previously-calculated p_norm to calculate p_norm for a new pair of subsequences)
In stumpy/aamp.py
, we can see the following lines of codes in _compute_diagonal
:
if uint64_i == 0 or uint64_j == 0:
p_norm = (
np.linalg.norm(
T_B[uint64_j : uint64_j + uint64_m]
- T_A[uint64_i : uint64_i + uint64_m],
ord=p,
)
** p
)
else:
p_norm = np.abs(
p_norm
- np.absolute(T_B[uint64_j - uint64_1] - T_A[uint64_i - uint64_1])
** p
+ np.absolute(
T_B[uint64_j + uint64_m - uint64_1]
- T_A[uint64_i + uint64_m - uint64_1]
)
** p
)
Now, by using func1
and func2
, we will have:
if uint64_i == 0 or uint64_j == 0:
p_norm =func1(..) # calculate p_norm directly using all elements of subsequence
else:
p_norm = func2(...) # rolling approach: use previously-calculated p_norm to calculate new p_norm
Tables of Performances
Note: The func1
and func2
in tables below is slightly different than the ones proposed in the first comment as they are now taking care of p==1
as well.
aamp, m=50, p=1 | n=1000 | n=5000 | n=10000 | n=25000 | n=50000 |
---|---|---|---|---|---|
main | 0.003 | 0.046 | 0.186 | 1.16 | 5.58 |
using func1 | 0.004 | 0.05 | 0.186 | 1.142 | 5.37 |
using func1 and func2 | 0.002 | 0.023 | 0.09 | 0.56 | 2.19 |
aamp, m=50, p=2 | n=1000 | n=5000 | n=10000 | n=25000 | n=50000 |
---|---|---|---|---|---|
main | 0.009 | 0.2 | 0.81 | 5.38 | 25.88 |
using func1 | 0.132 | 0.21 | 0.81 | 5.11 | 25.13 |
using func1 and func2 | 0.003 | 0.03 | 0.087 | 0.54 | 2.2 |
Note that func2
has more impact compared to func1
. That is because we need to directly calculate p_norm for just a few cases, and for most pair of subsequences, we just use the rolling approach, i.e. func2
, to calculate p_norm
.
The improvement in computing time for p=1
and n=50_000
is 60%
The improvement in computing time for p=2
and n=50_000
is 90%
gpu_aamp
In gpu_aamp.py
, we can see the following lines:
PART-I:
# in function gpu_aamp
for idx, start in enumerate(range(0, l, step)):
stop = min(l, start + step)
p_norm = np.power(core.mass_absolute(T_B[start : start + m], T_A, p=p), p)
p_norm_first = np.power(core.mass_absolute(T_A[:m], T_B, p=p), p)
Here, we can investigate and see if we can improve the computing time. Note that we use core.mass_absolute
, which calls scipy cdist
and it might be already fast (or okay)
PART-II:
# in function _compute_and_update_PI_kernel
if compute_p_norm:
p_norm_out[j] = (
p_norm_in[j - 1]
- abs(T_B[i - 1] - T_A[j - 1]) ** p
+ abs(T_B[i + m - 1] - T_A[j + m - 1]) ** p
)
I think this is similar to the updating process in _aamp
, and thus we can use func2
to speed up the process when p==1
or p==2
.
maamp
I can see the following cases in different parts of the codes
-
core.mass_absolute(..., p=p)
-
D = np.linalg.norm(disc_subseqs - disc_neighbors, axis=1, ord=p)
-
np.power(core.mass_absolute(...))
I feel the three items above should be fine. But, still, we should investigate.
- we can also see the following block:
if idx % 2 == 0:
# Even
p_norm_even[i, j] = (
p_norm_odd[i, j - 1]
- abs(T[i, idx - 1] - T[i, j - 1]) ** p
+ abs(T[i, idx + m - 1] - T[i, j + m - 1]) ** p
)
else:
# Odd
p_norm_odd[i, j] = (
p_norm_even[i, j - 1]
- abs(T[i, idx - 1] - T[i, j - 1]) ** p
+ abs(T[i, idx + m - 1] - T[i, j + m - 1]) ** p
)
we may try to optimize it (this seems to be similar to the updating process).
For now, I just investigated aamp
. So, I am not 100% sure about the things I said about gpu_aamp
and maamp
. Those are just my observations :)
@seanlaw Please let me know what you think. I already created a branch in my local repo, and created and tested the new functions. Do you want to find/discuss all possible places for this optimization opportunity here in this issue page? Or, do you prefer a step-by-step PR?
Now, by using func1 and func2, we will have:
Great! I don't love rolling_update_p_norm
as it isn't quite consistent with the other "rolling" functions that we've named.
I think this is similar to the updating process in _aamp, and thus we can use func2 to speed up the process when p==1 or p==2.
But func2
targets CPUs and so it cannot be used/called in _compute_and_update_PI_kernel
, which targets GPUs
Please let me know what you think.
Hmmm, now that I think about it. Let's put this one on pause for now until we've wrapped up all of the other more important stuff like fixing precision issues. I'm not sure that this is an immediate "value add" for users while many of the other stability related things are a definite value-add!
I don't love
rolling_update_p_norm
as it isn't quite consistent with the other "rolling" functions that we've named.
Yeah...I know! I tried (a little bit) to find a better name but couldn't.
But
func2
targets CPUs and so it cannot be used/called in_compute_and_update_PI_kernel
, which targets GPUs
Right! ~The code is simple though and it should be easy to write it as device function
. (but, we are then repeating ourselves)~
Hmmm, now that I think about it. Let's put this one on pause for now until we've wrapped up all of the other more important stuff like fixing precision issues. I'm not sure that this is an immediate "value add" for users while many of the other stability related things are a definite value-add!
Sure :)
The thing that triggered my mind was that the computing time for aamp(..., p=2.0)
was waaay more than stump
, and that did not make sense to me. ~Because, rolling cov
doesn't seem to have lower number of operations than rolling p_norm
(when p==2.0
). So, their running time should have been very close to each other.~
I will leave it here, and will try to resume recording the computing time for top-k (normalized=False)
.
@seanlaw I know you told me to leave this for now... but I think you will find this VERY interesting!!! (I do apologize for doing this last investigation 😄)
I realized that there is a huge difference in the computing time between p==2.0
and p==2
!!!
@njit(fastmath=True)
def my_func(a, p):
n = len(a)
x = 0
for i in range(n):
for j in range(n):
x = x + (a[i] ** p) + (a[j] ** p)
return x
And, the input:
seed = 0
np.random.seed(seed)
# n = 1000, 5000, 10000, 20000
a = np.random.rand(n)
n=1000 | n=5000 | n=10000 | n=20000 | |
---|---|---|---|---|
case1, p-2.0 | 0.036 | 0.77 | 3.03 | 12.05 |
case2, p=2 | 0.008 | 0.14 | 0.52 | 2.04 |
improvement | 78% | 82% | 83% | 83% |
I am not saying this is the only place that code can be optimized, but it seems that this is a very important thing!
Maybe, we do something similar to core.check_window_size(m, ...)
:
# in core.py
def check_order(p):
if int(p) == p:
return int(p)
else:
return p
p=1 | p=2 | p=3 | p=4 | |
---|---|---|---|---|
p: float, n=20000 | 1.43 | 12.04 | 12.02 | 12.13 |
p: int, n=20000 | 1.43 | 2.05 | 2.06 | 2.8 |
improvement | 0% | 83% | 83% | 77% |
Btw, I posted a question on numba discourse: https://numba.discourse.group/t/strange-behavior-in-the-performance-of-operator-p/1658
@NimaSarajpoor That is interesting! It seems that raising a float
to a power of another float
is more computationally expensive than raising it to a power of int
If p is an integer, the compiler can unroll the loop inside pow(x, 2) to x*x. If p is a float, the compiler emits code that is valid for all possible float values; thererby not allowing many optimizations.
If p is an integer, the compiler can unroll the loop inside pow(x, 2) to x*x. If p is a float, the compiler emits code that is valid for all possible float values; thererby not allowing many optimizations.
Very insightful! Thanks for sharing @JaKasb!
Maybe, we do something similar to core.check_window_size(m, ...):
@NimaSarajpoor I think we may be able to use a built-in Python function:
def _check_order(p):
if float(p).is_integer():
return int(p)
else:
return p