[WIP] Unsigned Integer Indexing Optimizations (#658)

Open alvii147 opened this issue 1 year ago • 14 comments

Added my first attempts at optimizing stump using unsigned integer indexing. I added a temporarity stumpy.stump_uint module, which (if successfully optimized) may replace stump. But I haven't had too much luck, maybe someone could help me out here. Here's what I've done so far and its results:

Time series length	window size	`stump` mean execution time	`stump_uint` mean execution time	percentage improvement
1000	10	0.00966	0.01066	-10.35%
2500	10	0.02866	0.02933	-2.323%
5000	10	0.079	0.095	-20.252%
10000	100	0.27566	0.30233	-9.674%
100000	100	26.511	27.163	-2.458%
100000	5000	25.706	27.825	-8.245%
200000	100	112.324	108.871	3.074%
200000	5000	105.614	104.584	0.976%
250000	100	169.058	169.176	-0.07%

Doesn't look like there's much of a difference just yet, but I'm nearly certain this is just me optimizing the wrong parts. Let me know if you have feedback, feel free to give this a shot yourself.

Aug 19 '22 22:08 alvii147

Oh and here's a diff based on what I've done, just because they are different files so github won't show you a diff.

Aug 19 '22 22:08 alvii147

Thanks @alvii147!

@NimaSarajpoor Do you have any time to take a first pass and review? Right now, it doesn't look like there is any speed up

Aug 20 '22 00:08 seanlaw

@alvii147 Can you fix the formatting so that the black checks pass? Btw, I don't see anything obvious that you are doing wrong 😢

Aug 20 '22 13:08 seanlaw

@alvii147 @seanlaw So, I took a look at the code and worked with it a little bit. I tried different things but didn't see improvement! ~I have a couple of comments that I am going to share with you.~

Aug 20 '22 15:08 NimaSarajpoor

@seanlaw @alvii147 Okay...so I had some comments but apparently I was wrong. I am doing some investigation. Will update you if I find something.

Aug 20 '22 16:08 NimaSarajpoor

@alvii147 Can you fix the formatting so that the black checks pass?

@seanlaw done! Also temporarily added test_stump_uint.py which is really just a copy of test_stump.py.

I am doing some investigation. Will update you if I find something.

@NimaSarajpoor thanks! Let me know what you find!

Aug 20 '22 19:08 alvii147

Codecov Report

Base: 99.89% // Head: 99.89% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (fa9c113) compared to base (208b796). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #659   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11453    11460    +7     
=======================================
+ Hits        11441    11448    +7     
  Misses         12       12

Impacted Files	Coverage Δ
stumpy/aamp.py	`100.00% <100.00%> (ø)`
stumpy/stump.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Aug 20 '22 19:08 codecov-commenter

This is what I got:

uint_performance

And, uint_performance

So, it seems there is improvement when length of T is small. However, when T gets longer, the gap becomes more and np.int64 performs better. Maybe this helps us understand what is going on. It is worthwhile to mention that this conclusion is slightly different than what one may infer from the table provided above. Because, according to the table, there is a little improvement when T becomes longer. This is not the case in my result. @alvii147: Did you use a fixed seed?

# performance code

stump_ComputeTime = []
stump_uint_ComputeTime = []

# set `seed`, and `m`
# n: a list of values indicating the length of `T`
for n_T in n: 
    
    np.random.seed(seed)
    T = np.random.rand(n_T)
    
    local_lst = []
    for _ in range(3):
        tic = time.time()
        stumpy.stump(T, m)
        toc = time.time()
        local_lst.append(toc - tic)
    
    stump_ComputeTime.append(np.mean(local_lst[1:]))
    
    #######################
    
    # we do not need to re-create `T`. I just repeated it here
    np.random.seed(seed)
    T = np.random.rand(n_T)
    
    local_lst = []
    for _ in range(3):
        tic = time.time()
        stumpy.stump_uint(T, m)
        toc = time.time()
        local_lst.append(toc - tic)
    
    stump_uint_ComputeTime.append(np.mean(local_lst[1:]))

I also tried the following changes but they did not help:

use np.uint64 instead of numba.uint64 (I wanted to make it similar to stump_tiles)
move m=np.uint64(m) to the outside of the outer-loop
just keep ik = np.uint(i + k)

Aug 20 '22 21:08 NimaSarajpoor

according to the table, there is a little improvement when T becomes longer

@NimaSarajpoor I think the differences in that table are too insignificant to conclude that one takes longer than the other. If anything, I think it shows that it's pretty much the same.

Did you use a fixed seed?

No I've just been randomly generating time series and executing stump and stump_uint 3-4 times each, and taking the mean between the execution times (this is probably a good place for a t-based significance test).

@NimaSarajpoor your plots show that we're probably not utilizing the uint trick properly. Either that, or there is not much room for improvement for stump. But it's too early to surrender, we can keep trying, I'll try to do this step by step and see which change takes more time than others.

Thanks a lot for your help!

Aug 20 '22 22:08 alvii147

@alvii147 @seanlaw Okay...I got something... And, I hope you can work on it to see if we can make it better / cleaner:

uint_performance

And,

uint_performance_longer


# stump_uint_v3

for diag_idx in range(diags_start_idx, diags_stop_idx):
        k = diags[diag_idx]

        if k >= 0:
            iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k)) # USED range, and REMOVED dtype=np.uint64
        else:
            iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))

        m = np.uint64(m) 
        for i in iter_range:
            ii = np.uint64(i)
            ik = np.uint64(i + k)
            if ii == 0 or ik == 0:
                cov = (
                    np.dot((T_B[ik : ik + m] - M_T[ik]), (T_A[ii : ii + m] - μ_Q[ii]))
                    * m_inverse
                )
            else:
                ###

I think the main factor in improvement was using range(...) instead of np.arange(....., dtype=np.uint64).

Aug 21 '22 00:08 NimaSarajpoor

@NimaSarajpoor wow that looks promising! I didn't think using np.arange would be slower than using range and then converting to np.uint64.

Aug 21 '22 12:08 alvii147

@NimaSarajpoor wow that looks promising! I didn't think using np.arange would be slower than using range and then converting to np.uint64.

Good catch @NimaSarajpoor! I'm realizing that np.arange would end up needlessly creating a new array every single time. With Python range, this creates a Python generator and does not need to allocate much memory and is much more efficient.

If we only need to call np.arange once or twice and the array isn't long then it's probably not a big deal (i.e., at the beginning of a function). However, if np.arange is called repeatedly (in a private function), then it likely affect performance! Now, let's be sure to compare Python range with np.int64 vs np.uint64 please!

Aug 21 '22 13:08 seanlaw

@alvii147

@NimaSarajpoor wow that looks promising! I didn't think using np.arange would be slower than using range and then converting to np.uint64.

Yeah...I had no idea as well... I was just trying things :)

Good catch @NimaSarajpoor! I'm realizing that np.arange would end up needlessly creating a new array every single time. With Python range, this creates a Python generator and does not need to allocate much memory and is much more efficient.

Ah...right! Maybe that is the reason. Thanks for the explanation

Now, let's be sure to compare Python range with np.int64 vs np.uint64 please!

Just a side note: one might try try to replace np.arange with range in stump_tiles and see if it gains better performance

Aug 21 '22 17:08 NimaSarajpoor

About replacing `np.arange` with `range`:

I assume that the numba compiler can infer more information from range. For the compiler's point of view, the output of np.arange could be an arbitrary array with arbitrary content. Whereas a range is a well defined iterator with a known start, end, stepsize, and dtype.

Regarding the performance plots:

Please use logarithmic axes and a larger dynamic range for the input size. In logarithmic plots, it is easer to spot scalability. Additionally, constant speedups are still visible in log plots. Relevant python package: https://github.com/nschloe/perfplot

Regarding the dtype optimizations:

I once had good success by porting my python code to cython, and then back to numba. The cython compiler can show you usefull informations for each line of code about the inferred data types. https://cython.readthedocs.io/en/latest/src/quickstart/build.html#using-the-jupyter-notebook If the cython compiler can infer the dtypes and compile the code to native code, then the chances are high that the numba compiler is also able to generate nice output.

It seems that numba also has some annotation capability: https://numba.readthedocs.io/en/stable/user/cli.html?highlight=annotate

Aug 24 '22 13:08 JaKasb

Hi @alvii147 Just checking in here. Any updates?

Sep 28 '22 14:09 seanlaw

Hey @seanlaw sorry I've been caught up with school, no new updates as of now. I'll start digging into this again in the coming weeks.

Sep 28 '22 15:09 alvii147

@NimaSarajpoor I tried to follow your changes here but didn't really see that big of an improvement.

Here's what I tried:

for diag_idx in range(diags_start_idx, diags_stop_idx):
    k = diags[diag_idx]

    if k >= 0:
        iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
    else:
        iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))

    for i in iter_range:
        ii = np.uint64(i)
        ik = np.uint64(i + k)
        ikm = np.uint64(i + k + m)
        im = np.uint64(i + m)

        if ii == 0 or (k < 0 and ii == -k):
            cov = np.dot((T_B[ik:ikm] - M_T[ik]), (T_A[ii:im] - μ_Q[ii])) * m_inverse

Here are the results:

Time series length	Window size	`stump` mean execution time	`stump_uint` mean execution time	Percentage improvement	Results accurate
1000	10	0.00512	0.00533	-4.081%	True
2500	10	0.00972	0.00945	2.7%	True
5000	10	0.02288	0.02252	1.561%	True
10000	100	0.06956	0.07159	-2.924%	True
100000	100	6.73673	7.08620	-5.187%	True
100000	5000	7.25523	7.19030	0.895%	True
200000	100	32.85791	32.80870	0.15%	True
200000	5000	32.97569	33.48455	-1.543%	True
250000	100	55.83575	55.31517	0.932%	True

Any ideas what we can try from here?

Oct 31 '22 01:10 alvii147

@alvii147 I tried to make the changes again and I got the following results on my pc:

# n: length of time series
# m: window size

n	m	main	uint	improvement(%)
10k	50	0.21	0.15	28.57142857
20k	50	0.77	0.51	33.76623377
25k	1000	1.15	0.81	29.56521739
50k	50	4.82	3.44	28.63070539
50k	5000	6.4	4.85	24.21875

I have provided my code below with some comments:

# in stumpy/stump.py

def _compute_diagonal(...):
    n_A = T_A.shape[0]
    n_B = T_B.shape[0]
    m_inverse = 1.0 / m
    constant = (m - 1) * m_inverse * m_inverse  # (m - 1)/(m * m)

    m = np.uint64(m) # JUST PUT IT OUTSIDE OF FOR-LOOP
    for diag_idx in range(diags_start_idx, diags_stop_idx):
        k = diags[diag_idx]

        if k >= 0:
            iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
        else:
            iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))

        for i in iter_range:
            # JUST calling np.uint64 twice here. 
            ii = np.uint64(i) 
            ik = np.uint64(i + k)
      
            if ii == 0 or ik == 0: # CONDITIONS SIMPLIFIED (WE DO NOT NEED `k < 0` )
                # ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
                # FOR INSTANCE: `i + k` is replaced with `ik`
                cov = (
                    np.dot(
                        (T_B[ik : ik + m] - M_T[ik]), (T_A[ii : ii + m] - μ_Q[ii])
                    )
                    * m_inverse
                )
            else: 
               # ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
                cov = cov + constant * (
                    cov_a[ik] * cov_b[ii] - cov_c[ik] * cov_d[ii]
                )
            
            # ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
            if T_B_subseq_isfinite[ik] and T_A_subseq_isfinite[ii]:
                # Neither subsequence contains NaNs
                if T_B_subseq_isconstant[ik] or T_A_subseq_isconstant[ii]:
                    pearson = 0.5
                else:
                    pearson = cov * Σ_T_inverse[ik] * σ_Q_inverse[ii]

                if T_B_subseq_isconstant[ik] and T_A_subseq_isconstant[ii]:
                    pearson = 1.0

                if pearson > ρ[thread_idx, ii, 0]:
                    ρ[thread_idx, ii, 0] = pearson
                    I[thread_idx, ii, 0] = ik

                if ignore_trivial:  # self-joins only
                    if pearson > ρ[thread_idx, ik, 0]:
                        ρ[thread_idx, ik, 0] = pearson
                        I[thread_idx, ik, 0] = ii

                    if ii < ik:
                        # left pearson correlation and left matrix profile index
                        if pearson > ρ[thread_idx, ik, 1]:
                            ρ[thread_idx, ik, 1] = pearson
                            I[thread_idx, ik, 1] = ii

                        # right pearson correlation and right matrix profile index
                        if pearson > ρ[thread_idx, ii, 2]:
                            ρ[thread_idx, ii, 2] = pearson
                            I[thread_idx, ii, 2] = ik

    return

I haven't played with it yet...so I do not know what particular change has the most impact here.

Also, the performance code:

seed = 0
np.random.seed(seed)
T = np.random.rand(n)

stumpy.stump(T, m)

n_iter = 20
lst = []

for itr in range(n_iter):
    tic=time.time()
    stumpy.stump(T, m)
    toc=time.time()
    
    lst.append(toc - tic)
   
a = np.array(lst)
np.median(a) 
# WHY MEDIAN? BECAUSE I THINK IT IS MORE ROBUST!

Also, small things might affect the computing time, for example:

if your laptop is plugged or unplugged! (it can change the power of cpu, and hence its performance)
if you do lots of iteration back to back for both versions of stump and stump_uint (In some cases, the computing time in later iterations may become higher.) So, it is safer if you run the performance for each version, and then get their median (I say median, because in a few cases the computing time suddenly gets high and then it goes back to normal. So, I think median is better than mean)

I tried to be careful about the two aforementioned items (I was consistent) when I was checking the performance of the two versions.

Oct 31 '22 04:10 NimaSarajpoor

Oops my bad looks like I screwed up my code for performance evaluation. It's looking better now:

Time series length	Window size	`stump` mean execution time	`stump_uint` mean execution time	Percentage improvement	Results accurate
1000	10	0.00502	0.00476	5.126%	True
2500	10	0.00977	0.00797	18.373%	True
5000	10	0.02324	0.01986	14.543%	True
10000	100	0.07620	0.05531	27.415%	True
100000	100	7.21882	5.64659	21.78%	True
100000	5000	7.79295	6.33472	18.712%	True
200000	100	36.84522	33.80169	8.26%	True
200000	5000	36.13178	34.10573	5.607%	True
250000	100	60.89817	58.27914	4.301%	True

It looks like the performance gap closes as we use larger time series. I can somewhat make sense of that, the performance difference between int and uint may be more insignificant as the overall matrix profile computation gets heavier.

@seanlaw what do you think?

Also, what's up with this? Am I missing a dependency?

Oct 31 '22 14:10 alvii147

It looks like the performance gap closes as we use larger time series. I can somewhat make sense of that, the performance difference between int and uint may be more insignificant as the overall matrix profile computation gets heavier. @seanlaw what do you think?

That's much better. However, can we simply see/understand how much of this improvement is coming from the use of range by itself? I really don't like seeing:

ii = np.uint64(i)
ik = np.uint64(i + k)

as I find it really ugly 😢

Also, what's up with this? Am I missing a dependency?

There seems to be an issue with the distributed package. I filed an issue here and hopefully a fix is coming soon.

Oct 31 '22 14:10 seanlaw

There seems to be an issue with the distributed package. I filed an https://github.com/dask/distributed/issues/7227 and hopefully a fix is coming soon

This has been resolved upstream in Dask. My apologies for the issue. The previous release has been yanked from PyPI and a new release has been published. It is on PyPI now and should be on conda-forge shortly.

Oct 31 '22 16:10 mrocklin

@seanlaw

However, can we simply see/understand how much of this improvement is coming from the use of range by itself?

I think what @alvii147 is trying to do here is optimizing the existing version of stump, which already uses range.

I really don't like seeing:
ii = np.uint64(i)
ik = np.uint64(i + k)
as I find it really ugly

Based on what I did, we have two main changes here:

removing the condition k < 0 (regardless of using np.uint)
using ii=np.uint64(i) and ik=np.uint64(i+k)

NOTE: Better names might be i_uint and ik_sum (but they are not pretty still)

Oct 31 '22 19:10 NimaSarajpoor

I think what @alvii147 is trying to do here is optimizing the existing version of stump, which already uses range.

Exactly, @seanlaw range by itself is what stump currently uses. The difference is the usage of uint64.

I really don't like seeing:

ii = np.uint64(i)
ik = np.uint64(i + k)

as I find it really ugly 😢

I agree this looks sketchy, the naming was just for testing, we can change this to something like i_uint and ik_uint?

Oct 31 '22 21:10 alvii147

removing the condition k < 0 (regardless of using np.uint)

@NimaSarajpoor this was an interesting optimization, my bad I forgot to include this in my last commit.

Now that I think about this change, do you think this should be a separate issue? That is, an issue that suggests using:

if i == 0 or i == -k:

Instead of

if i == 0 or (k < 0 and i == -k)

I don't think there's much difference in performance with this, but what you suggested looks cleaner. And correct me if I'm wrong @seanlaw, but if k > 0 then we will never satisfy i == -k anyway, as i should be positive.

Oct 31 '22 21:10 alvii147

I don't think there's much difference in performance with this, but what you suggested looks cleaner. And correct me if I'm wrong @seanlaw, but if k > 0 then we will never satisfy i == -k anyway, as i should be positive.

@alvii147 Frankly, I don't recall the reasoning for if i == 0 or (k < 0 and i == -k) as I did this a long time ago. I just remember that it was a necessary check.

I think what @alvii147 is trying to do here is optimizing the existing version of stump, which already uses range.

Thanks for the reminder. I had to go and refresh my memory!

Can you ensure that diags returns unsigned ints so that k is an unsigned int (and can then be added to an unsigned i (see below)
Instead of converting single values into uint, can you cast the entire iter_range to uint in one shot? If not, then I propose doing something like:

for signed_i in iter_range:
    i = np.uint64(signed_i)

I actually like seeing i + k everywhere...

Oct 31 '22 22:10 seanlaw

@alvii147 Frankly, I don't recall the reasoning for if i == 0 or (k < 0 and i == -k) as I did this a long time ago. I just remember that it was a necessary check.

I'm nearly certain the k < 0 is not necessary, both when you think about this with and without context.

Without context, i is always positive. Hence if i == -k, k will always be negative. Exception is when i = k = 0, but the i == 0 case is already covered in the first part of the statement.

With context, i == -k is just checking whether or not we are in the left most column of the distance matrix, and if so, we'd have to use the dot product to compute the distance, as opposed to using the diagonal property. When this is the case, k should always be negative anyway, as we would be dealing with a negative diagonal index.

Can you ensure that diags returns unsigned ints so that k is an unsigned int (and can then be added to an unsigned i (see below)

Unfortunately that can't happen, as diags can be negative, hence k can be negative. i + k on the other hand, should and will always be positive.

Instead of converting single values into uint, can you cast the entire iter_range to uint in one shot?

I was unable to find a good way to do this. If anything, we'd be going back to using np.arange, which we now know is really inefficient for this use case.

If not, then I propose doing something like:

for signed_i in iter_range:
    i = np.uint64(signed_i)

This looks cleaner! I can do this.

I actually like seeing i + k everywhere...

Agreed, i + k looks better and is more explanatory than ik. I have a counter-proposal: how about j = i + k?

The reason why I propose this is, i represents the current row of the distance matrix. Because k is the current diagonal index, i + k is essentially the current column of the distance matrix. If i is the row, it makes intuitive sense to name the column j. What do you think @seanlaw ?

I made the code a bit cleaner, used signed_i as you suggested and used j for np.uint64(i + k). In terms of performance, the results were roughly the same. Here's what it looks like, let me know what you think.

def _compute_diagonal(...):
    n_A = T_A.shape[0]
    n_B = T_B.shape[0]
    m_inverse = 1.0 / m
    constant = (m - 1) * m_inverse * m_inverse  # (m - 1)/(m * m)

    for diag_idx in range(diags_start_idx, diags_stop_idx):
        k = diags[diag_idx]

        if k >= 0:
            iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
        else:
            iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))

        for signed_i in iter_range:
            i = np.uint64(signed_i)
            j = np.uint64(i + k)

            if i == 0 or (k < 0 and i == -k):
                cov = (
                    np.dot(
                        (T_B[j : j + np.uint64(m)] - M_T[j]),
                        (T_A[i : i + np.uint64(m)] - μ_Q[i]),
                    )
                    * m_inverse
                )
            else:
                # The next lines are equivalent and left for reference
                # cov = cov + constant * (
                #     (T_B[i + k + m - 1] - M_T_m_1[i + k])
                #     * (T_A[i + m - 1] - μ_Q_m_1[i])
                #     - (T_B[i + k - 1] - M_T_m_1[i + k]) * (T_A[i - 1] - μ_Q_m_1[i])
                # )
                cov = cov + constant * (cov_a[j] * cov_b[i] - cov_c[j] * cov_d[i])

            if T_B_subseq_isfinite[j] and T_A_subseq_isfinite[i]:
                # Neither subsequence contains NaNs
                if T_B_subseq_isconstant[j] or T_A_subseq_isconstant[i]:
                    pearson = 0.5
                else:
                    pearson = cov * Σ_T_inverse[j] * σ_Q_inverse[i]

                if T_B_subseq_isconstant[j] and T_A_subseq_isconstant[i]:
                    pearson = 1.0

                if pearson > ρ[thread_idx, i, 0]:
                    ρ[thread_idx, i, 0] = pearson
                    I[thread_idx, i, 0] = j

                if ignore_trivial:  # self-joins only
                    if pearson > ρ[thread_idx, j, 0]:
                        ρ[thread_idx, j, 0] = pearson
                        I[thread_idx, j, 0] = i

                    if i < j:
                        # left pearson correlation and left matrix profile index
                        if pearson > ρ[thread_idx, j, 1]:
                            ρ[thread_idx, j, 1] = pearson
                            I[thread_idx, j, 1] = i

                        # right pearson correlation and right matrix profile index
                        if pearson > ρ[thread_idx, i, 2]:
                            ρ[thread_idx, i, 2] = pearson
                            I[thread_idx, i, 2] = j

    return

Time series length	Window size	`stump` mean execution time	`stump_uint` mean execution time	Percentage improvement	Results accurate
1000	10	0.00514	0.00579	-12.7%	True
2500	10	0.01070	0.00911	14.811%	True
5000	10	0.02291	0.02104	8.16%	True
10000	100	0.07196	0.05902	17.985%	True
100000	100	6.71042	5.61217	16.366%	True
100000	5000	7.37240	5.81313	21.15%	True
200000	100	32.59256	31.21777	4.218%	True
200000	5000	33.01471	31.84025	3.557%	True
250000	100	55.54036	54.86022	1.225%	True

Nov 01 '22 13:11 alvii147

@alvii147 There's something that doesn't seem right to me and I'm guessing that your timing comparisons aren't capturing this because you are performing a self-join only and you aren't checking the AB-join accuracy. So, for an AB-join, we'd end up doing iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)), which means that signed_i in for signed_i in iter_range: can be negative and yet you then force it to be unsigned in the next line with i = np.uint64(signed_i) when, in fact, it must be negative. Am I missing something in the logic here? Does it make sense to allow i = np.uint64(signed_i) when signed_i is negative?

Nov 01 '22 18:11 seanlaw

So, for an AB-join, we'd end up doing iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)), which means that signed_i in for signed_i in iter_range: can be negative

I don't think so, because iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)) is only executed if k < 0.

@alvii147 There's something that doesn't seem right to me and I'm guessing that your timing comparisons aren't capturing this because you are performing a self-join only and you aren't checking the AB-join accuracy.

Good point! Here are the performance results with AB joins instead of self joins:

Time series length	Window size	`stump` mean execution time	`stump_uint` mean execution time	Percentage improvement	Results accurate
1000	10	0.00496	0.00485	2.305%	True
2500	10	0.01112	0.00962	13.453%	True
5000	10	0.02812	0.01997	28.995%	True
10000	100	0.07364	0.05559	24.518%	True
100000	100	6.82463	5.73498	15.966%	True
100000	5000	7.58997	6.07498	19.96%	True
200000	100	33.59375	31.84067	5.218%	True
200000	5000	35.25538	33.29318	5.566%	True
250000	100	56.13076	54.11020	3.6%	True

In terms of accuracy, if we're in a bad spot, tests/test_stump_uint.py will let us know. :smile:

Nov 01 '22 20:11 alvii147

I don't think so, because iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)) is only executed if k < 0.

Ah gotcha! Thanks for the clarification.

Nov 01 '22 20:11 seanlaw

No worries! Any other thoughts before this PR is ready?

Nov 01 '22 21:11 alvii147

stumpy stumpy copied to clipboard

[WIP] Unsigned Integer Indexing Optimizations (#658)

Codecov Report

About replacing np.arange with range:

Regarding the performance plots:

Regarding the dtype optimizations:

stumpy
stumpy copied to clipboard

About replacing `np.arange` with `range`: