stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

[WIP] Unsigned Integer Indexing Optimizations (#658)

Open alvii147 opened this issue 1 year ago • 14 comments

Added my first attempts at optimizing stump using unsigned integer indexing. I added a temporarity stumpy.stump_uint module, which (if successfully optimized) may replace stump. But I haven't had too much luck, maybe someone could help me out here. Here's what I've done so far and its results:

Time series length window size stump mean execution time stump_uint mean execution time percentage improvement
1000 10 0.00966 0.01066 -10.35%
2500 10 0.02866 0.02933 -2.323%
5000 10 0.079 0.095 -20.252%
10000 100 0.27566 0.30233 -9.674%
100000 100 26.511 27.163 -2.458%
100000 5000 25.706 27.825 -8.245%
200000 100 112.324 108.871 3.074%
200000 5000 105.614 104.584 0.976%
250000 100 169.058 169.176 -0.07%

Doesn't look like there's much of a difference just yet, but I'm nearly certain this is just me optimizing the wrong parts. Let me know if you have feedback, feel free to give this a shot yourself.

alvii147 avatar Aug 19 '22 22:08 alvii147

Oh and here's a diff based on what I've done, just because they are different files so github won't show you a diff.

image

image

alvii147 avatar Aug 19 '22 22:08 alvii147

Thanks @alvii147!

@NimaSarajpoor Do you have any time to take a first pass and review? Right now, it doesn't look like there is any speed up

seanlaw avatar Aug 20 '22 00:08 seanlaw

@alvii147 Can you fix the formatting so that the black checks pass? Btw, I don't see anything obvious that you are doing wrong 😢

seanlaw avatar Aug 20 '22 13:08 seanlaw

@alvii147 @seanlaw So, I took a look at the code and worked with it a little bit. I tried different things but didn't see improvement! ~I have a couple of comments that I am going to share with you.~

NimaSarajpoor avatar Aug 20 '22 15:08 NimaSarajpoor

@seanlaw @alvii147 Okay...so I had some comments but apparently I was wrong. I am doing some investigation. Will update you if I find something.

NimaSarajpoor avatar Aug 20 '22 16:08 NimaSarajpoor

@alvii147 Can you fix the formatting so that the black checks pass?

@seanlaw done! Also temporarily added test_stump_uint.py which is really just a copy of test_stump.py.

I am doing some investigation. Will update you if I find something.

@NimaSarajpoor thanks! Let me know what you find!

alvii147 avatar Aug 20 '22 19:08 alvii147

Codecov Report

Base: 99.89% // Head: 99.89% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (fa9c113) compared to base (208b796). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #659   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11453    11460    +7     
=======================================
+ Hits        11441    11448    +7     
  Misses         12       12           
Impacted Files Coverage Δ
stumpy/aamp.py 100.00% <100.00%> (ø)
stumpy/stump.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Aug 20 '22 19:08 codecov-commenter

This is what I got:

uint_performance

And, uint_performance

So, it seems there is improvement when length of T is small. However, when T gets longer, the gap becomes more and np.int64 performs better. Maybe this helps us understand what is going on. It is worthwhile to mention that this conclusion is slightly different than what one may infer from the table provided above. Because, according to the table, there is a little improvement when T becomes longer. This is not the case in my result. @alvii147: Did you use a fixed seed?


# performance code

stump_ComputeTime = []
stump_uint_ComputeTime = []

# set `seed`, and `m`
# n: a list of values indicating the length of `T`
for n_T in n: 
    
    np.random.seed(seed)
    T = np.random.rand(n_T)
    
    local_lst = []
    for _ in range(3):
        tic = time.time()
        stumpy.stump(T, m)
        toc = time.time()
        local_lst.append(toc - tic)
    
    stump_ComputeTime.append(np.mean(local_lst[1:]))
    
    #######################
    
    # we do not need to re-create `T`. I just repeated it here
    np.random.seed(seed)
    T = np.random.rand(n_T)
    
    local_lst = []
    for _ in range(3):
        tic = time.time()
        stumpy.stump_uint(T, m)
        toc = time.time()
        local_lst.append(toc - tic)
    
    stump_uint_ComputeTime.append(np.mean(local_lst[1:]))


I also tried the following changes but they did not help:

  • use np.uint64 instead of numba.uint64 (I wanted to make it similar to stump_tiles)
  • move m=np.uint64(m) to the outside of the outer-loop
  • just keep ik = np.uint(i + k)

NimaSarajpoor avatar Aug 20 '22 21:08 NimaSarajpoor

according to the table, there is a little improvement when T becomes longer

@NimaSarajpoor I think the differences in that table are too insignificant to conclude that one takes longer than the other. If anything, I think it shows that it's pretty much the same.

Did you use a fixed seed?

No I've just been randomly generating time series and executing stump and stump_uint 3-4 times each, and taking the mean between the execution times (this is probably a good place for a t-based significance test).

@NimaSarajpoor your plots show that we're probably not utilizing the uint trick properly. Either that, or there is not much room for improvement for stump. But it's too early to surrender, we can keep trying, I'll try to do this step by step and see which change takes more time than others.

Thanks a lot for your help!

alvii147 avatar Aug 20 '22 22:08 alvii147

@alvii147 @seanlaw Okay...I got something... And, I hope you can work on it to see if we can make it better / cleaner:

uint_performance

And,

uint_performance_longer



# stump_uint_v3

for diag_idx in range(diags_start_idx, diags_stop_idx):
        k = diags[diag_idx]

        if k >= 0:
            iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k)) # USED range, and REMOVED dtype=np.uint64
        else:
            iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))

        m = np.uint64(m) 
        for i in iter_range:
            ii = np.uint64(i)
            ik = np.uint64(i + k)
            if ii == 0 or ik == 0:
                cov = (
                    np.dot((T_B[ik : ik + m] - M_T[ik]), (T_A[ii : ii + m] - μ_Q[ii]))
                    * m_inverse
                )
            else:
                ###

I think the main factor in improvement was using range(...) instead of np.arange(....., dtype=np.uint64).

NimaSarajpoor avatar Aug 21 '22 00:08 NimaSarajpoor

@NimaSarajpoor wow that looks promising! I didn't think using np.arange would be slower than using range and then converting to np.uint64.

alvii147 avatar Aug 21 '22 12:08 alvii147

@NimaSarajpoor wow that looks promising! I didn't think using np.arange would be slower than using range and then converting to np.uint64.

Good catch @NimaSarajpoor! I'm realizing that np.arange would end up needlessly creating a new array every single time. With Python range, this creates a Python generator and does not need to allocate much memory and is much more efficient.

If we only need to call np.arange once or twice and the array isn't long then it's probably not a big deal (i.e., at the beginning of a function). However, if np.arange is called repeatedly (in a private function), then it likely affect performance! Now, let's be sure to compare Python range with np.int64 vs np.uint64 please!

seanlaw avatar Aug 21 '22 13:08 seanlaw

@alvii147

@NimaSarajpoor wow that looks promising! I didn't think using np.arange would be slower than using range and then converting to np.uint64.

Yeah...I had no idea as well... I was just trying things :)

Good catch @NimaSarajpoor! I'm realizing that np.arange would end up needlessly creating a new array every single time. With Python range, this creates a Python generator and does not need to allocate much memory and is much more efficient.

Ah...right! Maybe that is the reason. Thanks for the explanation

Now, let's be sure to compare Python range with np.int64 vs np.uint64 please!

Just a side note: one might try try to replace np.arange with range in stump_tiles and see if it gains better performance

NimaSarajpoor avatar Aug 21 '22 17:08 NimaSarajpoor

About replacing np.arange with range:

I assume that the numba compiler can infer more information from range. For the compiler's point of view, the output of np.arange could be an arbitrary array with arbitrary content. Whereas a range is a well defined iterator with a known start, end, stepsize, and dtype.

Regarding the performance plots:

Please use logarithmic axes and a larger dynamic range for the input size. In logarithmic plots, it is easer to spot scalability. Additionally, constant speedups are still visible in log plots. Relevant python package: https://github.com/nschloe/perfplot

Regarding the dtype optimizations:

I once had good success by porting my python code to cython, and then back to numba. The cython compiler can show you usefull informations for each line of code about the inferred data types. https://cython.readthedocs.io/en/latest/src/quickstart/build.html#using-the-jupyter-notebook If the cython compiler can infer the dtypes and compile the code to native code, then the chances are high that the numba compiler is also able to generate nice output.

It seems that numba also has some annotation capability: https://numba.readthedocs.io/en/stable/user/cli.html?highlight=annotate

JaKasb avatar Aug 24 '22 13:08 JaKasb

Hi @alvii147 Just checking in here. Any updates?

seanlaw avatar Sep 28 '22 14:09 seanlaw

Hey @seanlaw sorry I've been caught up with school, no new updates as of now. I'll start digging into this again in the coming weeks.

alvii147 avatar Sep 28 '22 15:09 alvii147

@NimaSarajpoor I tried to follow your changes here but didn't really see that big of an improvement.

Here's what I tried:

for diag_idx in range(diags_start_idx, diags_stop_idx):
    k = diags[diag_idx]

    if k >= 0:
        iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
    else:
        iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))

    for i in iter_range:
        ii = np.uint64(i)
        ik = np.uint64(i + k)
        ikm = np.uint64(i + k + m)
        im = np.uint64(i + m)

        if ii == 0 or (k < 0 and ii == -k):
            cov = np.dot((T_B[ik:ikm] - M_T[ik]), (T_A[ii:im] - μ_Q[ii])) * m_inverse

Here are the results:

Time series length Window size stump mean execution time stump_uint mean execution time Percentage improvement Results accurate
1000 10 0.00512 0.00533 -4.081% True
2500 10 0.00972 0.00945 2.7% True
5000 10 0.02288 0.02252 1.561% True
10000 100 0.06956 0.07159 -2.924% True
100000 100 6.73673 7.08620 -5.187% True
100000 5000 7.25523 7.19030 0.895% True
200000 100 32.85791 32.80870 0.15% True
200000 5000 32.97569 33.48455 -1.543% True
250000 100 55.83575 55.31517 0.932% True

Any ideas what we can try from here?

alvii147 avatar Oct 31 '22 01:10 alvii147

@alvii147 I tried to make the changes again and I got the following results on my pc:

# n: length of time series
# m: window size
n m main uint improvement(%)
10k 50 0.21 0.15 28.57142857
20k 50 0.77 0.51 33.76623377
25k 1000 1.15 0.81 29.56521739
50k 50 4.82 3.44 28.63070539
50k 5000 6.4 4.85 24.21875

I have provided my code below with some comments:

# in stumpy/stump.py

def _compute_diagonal(...):
    n_A = T_A.shape[0]
    n_B = T_B.shape[0]
    m_inverse = 1.0 / m
    constant = (m - 1) * m_inverse * m_inverse  # (m - 1)/(m * m)

    m = np.uint64(m) # JUST PUT IT OUTSIDE OF FOR-LOOP
    for diag_idx in range(diags_start_idx, diags_stop_idx):
        k = diags[diag_idx]

        if k >= 0:
            iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
        else:
            iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))

        for i in iter_range:
            # JUST calling np.uint64 twice here. 
            ii = np.uint64(i) 
            ik = np.uint64(i + k)
      
            if ii == 0 or ik == 0: # CONDITIONS SIMPLIFIED (WE DO NOT NEED `k < 0` )
                # ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
                # FOR INSTANCE: `i + k` is replaced with `ik`
                cov = (
                    np.dot(
                        (T_B[ik : ik + m] - M_T[ik]), (T_A[ii : ii + m] - μ_Q[ii])
                    )
                    * m_inverse
                )
            else: 
               # ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
                cov = cov + constant * (
                    cov_a[ik] * cov_b[ii] - cov_c[ik] * cov_d[ii]
                )
            
            # ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
            if T_B_subseq_isfinite[ik] and T_A_subseq_isfinite[ii]:
                # Neither subsequence contains NaNs
                if T_B_subseq_isconstant[ik] or T_A_subseq_isconstant[ii]:
                    pearson = 0.5
                else:
                    pearson = cov * Σ_T_inverse[ik] * σ_Q_inverse[ii]

                if T_B_subseq_isconstant[ik] and T_A_subseq_isconstant[ii]:
                    pearson = 1.0

                if pearson > ρ[thread_idx, ii, 0]:
                    ρ[thread_idx, ii, 0] = pearson
                    I[thread_idx, ii, 0] = ik

                if ignore_trivial:  # self-joins only
                    if pearson > ρ[thread_idx, ik, 0]:
                        ρ[thread_idx, ik, 0] = pearson
                        I[thread_idx, ik, 0] = ii

                    if ii < ik:
                        # left pearson correlation and left matrix profile index
                        if pearson > ρ[thread_idx, ik, 1]:
                            ρ[thread_idx, ik, 1] = pearson
                            I[thread_idx, ik, 1] = ii

                        # right pearson correlation and right matrix profile index
                        if pearson > ρ[thread_idx, ii, 2]:
                            ρ[thread_idx, ii, 2] = pearson
                            I[thread_idx, ii, 2] = ik

    return

I haven't played with it yet...so I do not know what particular change has the most impact here.


Also, the performance code:

seed = 0
np.random.seed(seed)
T = np.random.rand(n)

stumpy.stump(T, m)

n_iter = 20
lst = []

for itr in range(n_iter):
    tic=time.time()
    stumpy.stump(T, m)
    toc=time.time()
    
    lst.append(toc - tic)
   
a = np.array(lst)
np.median(a) 
# WHY MEDIAN? BECAUSE I THINK IT IS MORE ROBUST!

Also, small things might affect the computing time, for example:

  • if your laptop is plugged or unplugged! (it can change the power of cpu, and hence its performance)
  • if you do lots of iteration back to back for both versions of stump and stump_uint (In some cases, the computing time in later iterations may become higher.) So, it is safer if you run the performance for each version, and then get their median (I say median, because in a few cases the computing time suddenly gets high and then it goes back to normal. So, I think median is better than mean)

I tried to be careful about the two aforementioned items (I was consistent) when I was checking the performance of the two versions.

NimaSarajpoor avatar Oct 31 '22 04:10 NimaSarajpoor

Oops my bad looks like I screwed up my code for performance evaluation. It's looking better now:

Time series length Window size stump mean execution time stump_uint mean execution time Percentage improvement Results accurate
1000 10 0.00502 0.00476 5.126% True
2500 10 0.00977 0.00797 18.373% True
5000 10 0.02324 0.01986 14.543% True
10000 100 0.07620 0.05531 27.415% True
100000 100 7.21882 5.64659 21.78% True
100000 5000 7.79295 6.33472 18.712% True
200000 100 36.84522 33.80169 8.26% True
200000 5000 36.13178 34.10573 5.607% True
250000 100 60.89817 58.27914 4.301% True

It looks like the performance gap closes as we use larger time series. I can somewhat make sense of that, the performance difference between int and uint may be more insignificant as the overall matrix profile computation gets heavier.

@seanlaw what do you think?

Also, what's up with this? Am I missing a dependency?

alvii147 avatar Oct 31 '22 14:10 alvii147

It looks like the performance gap closes as we use larger time series. I can somewhat make sense of that, the performance difference between int and uint may be more insignificant as the overall matrix profile computation gets heavier. @seanlaw what do you think?

That's much better. However, can we simply see/understand how much of this improvement is coming from the use of range by itself? I really don't like seeing:

ii = np.uint64(i)
ik = np.uint64(i + k)

as I find it really ugly 😢

Also, what's up with this? Am I missing a dependency?

There seems to be an issue with the distributed package. I filed an issue here and hopefully a fix is coming soon.

seanlaw avatar Oct 31 '22 14:10 seanlaw

There seems to be an issue with the distributed package. I filed an https://github.com/dask/distributed/issues/7227 and hopefully a fix is coming soon

This has been resolved upstream in Dask. My apologies for the issue. The previous release has been yanked from PyPI and a new release has been published. It is on PyPI now and should be on conda-forge shortly.

mrocklin avatar Oct 31 '22 16:10 mrocklin

@seanlaw

However, can we simply see/understand how much of this improvement is coming from the use of range by itself?

I think what @alvii147 is trying to do here is optimizing the existing version of stump, which already uses range.

I really don't like seeing:

ii = np.uint64(i)
ik = np.uint64(i + k)

as I find it really ugly

Based on what I did, we have two main changes here:

  • removing the condition k < 0 (regardless of using np.uint)
  • using ii=np.uint64(i) and ik=np.uint64(i+k)

NOTE: Better names might be i_uint and ik_sum (but they are not pretty still)

NimaSarajpoor avatar Oct 31 '22 19:10 NimaSarajpoor

I think what @alvii147 is trying to do here is optimizing the existing version of stump, which already uses range.

Exactly, @seanlaw range by itself is what stump currently uses. The difference is the usage of uint64.

I really don't like seeing:

ii = np.uint64(i)
ik = np.uint64(i + k)

as I find it really ugly 😢

I agree this looks sketchy, the naming was just for testing, we can change this to something like i_uint and ik_uint?

alvii147 avatar Oct 31 '22 21:10 alvii147

removing the condition k < 0 (regardless of using np.uint)

@NimaSarajpoor this was an interesting optimization, my bad I forgot to include this in my last commit.

Now that I think about this change, do you think this should be a separate issue? That is, an issue that suggests using:

if i == 0 or i == -k:

Instead of

if i == 0 or (k < 0 and i == -k)

I don't think there's much difference in performance with this, but what you suggested looks cleaner. And correct me if I'm wrong @seanlaw, but if k > 0 then we will never satisfy i == -k anyway, as i should be positive.

alvii147 avatar Oct 31 '22 21:10 alvii147

I don't think there's much difference in performance with this, but what you suggested looks cleaner. And correct me if I'm wrong @seanlaw, but if k > 0 then we will never satisfy i == -k anyway, as i should be positive.

@alvii147 Frankly, I don't recall the reasoning for if i == 0 or (k < 0 and i == -k) as I did this a long time ago. I just remember that it was a necessary check.

I think what @alvii147 is trying to do here is optimizing the existing version of stump, which already uses range.

Thanks for the reminder. I had to go and refresh my memory!

  1. Can you ensure that diags returns unsigned ints so that k is an unsigned int (and can then be added to an unsigned i (see below)
  2. Instead of converting single values into uint, can you cast the entire iter_range to uint in one shot? If not, then I propose doing something like:
for signed_i in iter_range:
    i = np.uint64(signed_i)

I actually like seeing i + k everywhere...

seanlaw avatar Oct 31 '22 22:10 seanlaw

@alvii147 Frankly, I don't recall the reasoning for if i == 0 or (k < 0 and i == -k) as I did this a long time ago. I just remember that it was a necessary check.

I'm nearly certain the k < 0 is not necessary, both when you think about this with and without context.

Without context, i is always positive. Hence if i == -k, k will always be negative. Exception is when i = k = 0, but the i == 0 case is already covered in the first part of the statement.

With context, i == -k is just checking whether or not we are in the left most column of the distance matrix, and if so, we'd have to use the dot product to compute the distance, as opposed to using the diagonal property. When this is the case, k should always be negative anyway, as we would be dealing with a negative diagonal index.

Can you ensure that diags returns unsigned ints so that k is an unsigned int (and can then be added to an unsigned i (see below)

Unfortunately that can't happen, as diags can be negative, hence k can be negative. i + k on the other hand, should and will always be positive.

Instead of converting single values into uint, can you cast the entire iter_range to uint in one shot?

I was unable to find a good way to do this. If anything, we'd be going back to using np.arange, which we now know is really inefficient for this use case.

If not, then I propose doing something like:

for signed_i in iter_range:
    i = np.uint64(signed_i)

This looks cleaner! I can do this.

I actually like seeing i + k everywhere...

Agreed, i + k looks better and is more explanatory than ik. I have a counter-proposal: how about j = i + k?

The reason why I propose this is, i represents the current row of the distance matrix. Because k is the current diagonal index, i + k is essentially the current column of the distance matrix. If i is the row, it makes intuitive sense to name the column j. What do you think @seanlaw ?

I made the code a bit cleaner, used signed_i as you suggested and used j for np.uint64(i + k). In terms of performance, the results were roughly the same. Here's what it looks like, let me know what you think.

def _compute_diagonal(...):
    n_A = T_A.shape[0]
    n_B = T_B.shape[0]
    m_inverse = 1.0 / m
    constant = (m - 1) * m_inverse * m_inverse  # (m - 1)/(m * m)

    for diag_idx in range(diags_start_idx, diags_stop_idx):
        k = diags[diag_idx]

        if k >= 0:
            iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
        else:
            iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))

        for signed_i in iter_range:
            i = np.uint64(signed_i)
            j = np.uint64(i + k)

            if i == 0 or (k < 0 and i == -k):
                cov = (
                    np.dot(
                        (T_B[j : j + np.uint64(m)] - M_T[j]),
                        (T_A[i : i + np.uint64(m)] - μ_Q[i]),
                    )
                    * m_inverse
                )
            else:
                # The next lines are equivalent and left for reference
                # cov = cov + constant * (
                #     (T_B[i + k + m - 1] - M_T_m_1[i + k])
                #     * (T_A[i + m - 1] - μ_Q_m_1[i])
                #     - (T_B[i + k - 1] - M_T_m_1[i + k]) * (T_A[i - 1] - μ_Q_m_1[i])
                # )
                cov = cov + constant * (cov_a[j] * cov_b[i] - cov_c[j] * cov_d[i])

            if T_B_subseq_isfinite[j] and T_A_subseq_isfinite[i]:
                # Neither subsequence contains NaNs
                if T_B_subseq_isconstant[j] or T_A_subseq_isconstant[i]:
                    pearson = 0.5
                else:
                    pearson = cov * Σ_T_inverse[j] * σ_Q_inverse[i]

                if T_B_subseq_isconstant[j] and T_A_subseq_isconstant[i]:
                    pearson = 1.0

                if pearson > ρ[thread_idx, i, 0]:
                    ρ[thread_idx, i, 0] = pearson
                    I[thread_idx, i, 0] = j

                if ignore_trivial:  # self-joins only
                    if pearson > ρ[thread_idx, j, 0]:
                        ρ[thread_idx, j, 0] = pearson
                        I[thread_idx, j, 0] = i

                    if i < j:
                        # left pearson correlation and left matrix profile index
                        if pearson > ρ[thread_idx, j, 1]:
                            ρ[thread_idx, j, 1] = pearson
                            I[thread_idx, j, 1] = i

                        # right pearson correlation and right matrix profile index
                        if pearson > ρ[thread_idx, i, 2]:
                            ρ[thread_idx, i, 2] = pearson
                            I[thread_idx, i, 2] = j

    return
Time series length Window size stump mean execution time stump_uint mean execution time Percentage improvement Results accurate
1000 10 0.00514 0.00579 -12.7% True
2500 10 0.01070 0.00911 14.811% True
5000 10 0.02291 0.02104 8.16% True
10000 100 0.07196 0.05902 17.985% True
100000 100 6.71042 5.61217 16.366% True
100000 5000 7.37240 5.81313 21.15% True
200000 100 32.59256 31.21777 4.218% True
200000 5000 33.01471 31.84025 3.557% True
250000 100 55.54036 54.86022 1.225% True

alvii147 avatar Nov 01 '22 13:11 alvii147

@alvii147 There's something that doesn't seem right to me and I'm guessing that your timing comparisons aren't capturing this because you are performing a self-join only and you aren't checking the AB-join accuracy. So, for an AB-join, we'd end up doing iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)), which means that signed_i in for signed_i in iter_range: can be negative and yet you then force it to be unsigned in the next line with i = np.uint64(signed_i) when, in fact, it must be negative. Am I missing something in the logic here? Does it make sense to allow i = np.uint64(signed_i) when signed_i is negative?

seanlaw avatar Nov 01 '22 18:11 seanlaw

So, for an AB-join, we'd end up doing iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)), which means that signed_i in for signed_i in iter_range: can be negative

I don't think so, because iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)) is only executed if k < 0.

@alvii147 There's something that doesn't seem right to me and I'm guessing that your timing comparisons aren't capturing this because you are performing a self-join only and you aren't checking the AB-join accuracy.

Good point! Here are the performance results with AB joins instead of self joins:

Time series length Window size stump mean execution time stump_uint mean execution time Percentage improvement Results accurate
1000 10 0.00496 0.00485 2.305% True
2500 10 0.01112 0.00962 13.453% True
5000 10 0.02812 0.01997 28.995% True
10000 100 0.07364 0.05559 24.518% True
100000 100 6.82463 5.73498 15.966% True
100000 5000 7.58997 6.07498 19.96% True
200000 100 33.59375 31.84067 5.218% True
200000 5000 35.25538 33.29318 5.566% True
250000 100 56.13076 54.11020 3.6% True

In terms of accuracy, if we're in a bad spot, tests/test_stump_uint.py will let us know. :smile:

alvii147 avatar Nov 01 '22 20:11 alvii147

I don't think so, because iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)) is only executed if k < 0.

Ah gotcha! Thanks for the clarification.

seanlaw avatar Nov 01 '22 20:11 seanlaw

No worries! Any other thoughts before this PR is ready?

alvii147 avatar Nov 01 '22 21:11 alvii147