stumpy
stumpy copied to clipboard
[WIP] Unsigned Integer Indexing Optimizations (#658)
Added my first attempts at optimizing stump
using unsigned integer indexing. I added a temporarity stumpy.stump_uint
module, which (if successfully optimized) may replace stump
. But I haven't had too much luck, maybe someone could help me out here. Here's what I've done so far and its results:
Time series length | window size | stump mean execution time |
stump_uint mean execution time |
percentage improvement |
---|---|---|---|---|
1000 | 10 | 0.00966 | 0.01066 | -10.35% |
2500 | 10 | 0.02866 | 0.02933 | -2.323% |
5000 | 10 | 0.079 | 0.095 | -20.252% |
10000 | 100 | 0.27566 | 0.30233 | -9.674% |
100000 | 100 | 26.511 | 27.163 | -2.458% |
100000 | 5000 | 25.706 | 27.825 | -8.245% |
200000 | 100 | 112.324 | 108.871 | 3.074% |
200000 | 5000 | 105.614 | 104.584 | 0.976% |
250000 | 100 | 169.058 | 169.176 | -0.07% |
Doesn't look like there's much of a difference just yet, but I'm nearly certain this is just me optimizing the wrong parts. Let me know if you have feedback, feel free to give this a shot yourself.
Oh and here's a diff
based on what I've done, just because they are different files so github won't show you a diff.
Thanks @alvii147!
@NimaSarajpoor Do you have any time to take a first pass and review? Right now, it doesn't look like there is any speed up
@alvii147 Can you fix the formatting so that the black
checks pass? Btw, I don't see anything obvious that you are doing wrong 😢
@alvii147 @seanlaw So, I took a look at the code and worked with it a little bit. I tried different things but didn't see improvement! ~I have a couple of comments that I am going to share with you.~
@seanlaw @alvii147 Okay...so I had some comments but apparently I was wrong. I am doing some investigation. Will update you if I find something.
@alvii147 Can you fix the formatting so that the black checks pass?
@seanlaw done! Also temporarily added test_stump_uint.py
which is really just a copy of test_stump.py
.
I am doing some investigation. Will update you if I find something.
@NimaSarajpoor thanks! Let me know what you find!
Codecov Report
Base: 99.89% // Head: 99.89% // Increases project coverage by +0.00%
:tada:
Coverage data is based on head (
fa9c113
) compared to base (208b796
). Patch coverage: 100.00% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## main #659 +/- ##
=======================================
Coverage 99.89% 99.89%
=======================================
Files 80 80
Lines 11453 11460 +7
=======================================
+ Hits 11441 11448 +7
Misses 12 12
Impacted Files | Coverage Δ | |
---|---|---|
stumpy/aamp.py | 100.00% <100.00%> (ø) |
|
stumpy/stump.py | 100.00% <100.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
This is what I got:
And,
So, it seems there is improvement when length of T
is small. However, when T
gets longer, the gap becomes more and np.int64
performs better. Maybe this helps us understand what is going on. It is worthwhile to mention that this conclusion is slightly different than what one may infer from the table provided above. Because, according to the table, there is a little improvement when T
becomes longer. This is not the case in my result. @alvii147: Did you use a fixed seed
?
# performance code
stump_ComputeTime = []
stump_uint_ComputeTime = []
# set `seed`, and `m`
# n: a list of values indicating the length of `T`
for n_T in n:
np.random.seed(seed)
T = np.random.rand(n_T)
local_lst = []
for _ in range(3):
tic = time.time()
stumpy.stump(T, m)
toc = time.time()
local_lst.append(toc - tic)
stump_ComputeTime.append(np.mean(local_lst[1:]))
#######################
# we do not need to re-create `T`. I just repeated it here
np.random.seed(seed)
T = np.random.rand(n_T)
local_lst = []
for _ in range(3):
tic = time.time()
stumpy.stump_uint(T, m)
toc = time.time()
local_lst.append(toc - tic)
stump_uint_ComputeTime.append(np.mean(local_lst[1:]))
I also tried the following changes but they did not help:
- use
np.uint64
instead ofnumba.uint64
(I wanted to make it similar tostump_tiles
) - move m=np.uint64(m) to the outside of the outer-loop
- just keep
ik = np.uint(i + k)
according to the table, there is a little improvement when T becomes longer
@NimaSarajpoor I think the differences in that table are too insignificant to conclude that one takes longer than the other. If anything, I think it shows that it's pretty much the same.
Did you use a fixed seed?
No I've just been randomly generating time series and executing stump
and stump_uint
3-4 times each, and taking the mean between the execution times (this is probably a good place for a t-based significance test).
@NimaSarajpoor your plots show that we're probably not utilizing the uint
trick properly. Either that, or there is not much room for improvement for stump
. But it's too early to surrender, we can keep trying, I'll try to do this step by step and see which change takes more time than others.
Thanks a lot for your help!
@alvii147 @seanlaw Okay...I got something... And, I hope you can work on it to see if we can make it better / cleaner:
And,
# stump_uint_v3
for diag_idx in range(diags_start_idx, diags_stop_idx):
k = diags[diag_idx]
if k >= 0:
iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k)) # USED range, and REMOVED dtype=np.uint64
else:
iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))
m = np.uint64(m)
for i in iter_range:
ii = np.uint64(i)
ik = np.uint64(i + k)
if ii == 0 or ik == 0:
cov = (
np.dot((T_B[ik : ik + m] - M_T[ik]), (T_A[ii : ii + m] - μ_Q[ii]))
* m_inverse
)
else:
###
I think the main factor in improvement was using range(...)
instead of np.arange(....., dtype=np.uint64)
.
@NimaSarajpoor wow that looks promising! I didn't think using np.arange
would be slower than using range
and then converting to np.uint64
.
@NimaSarajpoor wow that looks promising! I didn't think using
np.arange
would be slower than usingrange
and then converting tonp.uint64
.
Good catch @NimaSarajpoor! I'm realizing that np.arange
would end up needlessly creating a new array every single time. With Python range
, this creates a Python generator and does not need to allocate much memory and is much more efficient.
If we only need to call np.arange
once or twice and the array isn't long then it's probably not a big deal (i.e., at the beginning of a function). However, if np.arange
is called repeatedly (in a private function), then it likely affect performance! Now, let's be sure to compare Python range
with np.int64
vs np.uint64
please!
@alvii147
@NimaSarajpoor wow that looks promising! I didn't think using np.arange would be slower than using range and then converting to np.uint64.
Yeah...I had no idea as well... I was just trying things :)
Good catch @NimaSarajpoor! I'm realizing that
np.arange
would end up needlessly creating a new array every single time. With Pythonrange
, this creates a Python generator and does not need to allocate much memory and is much more efficient.
Ah...right! Maybe that is the reason. Thanks for the explanation
Now, let's be sure to compare Python range with np.int64 vs np.uint64 please!
Just a side note: one might try try to replace np.arange
with range
in stump_tiles
and see if it gains better performance
About replacing np.arange
with range
:
I assume that the numba compiler can infer more information from range
.
For the compiler's point of view, the output of np.arange
could be an arbitrary array with arbitrary content.
Whereas a range
is a well defined iterator with a known start, end, stepsize, and dtype.
Regarding the performance plots:
Please use logarithmic axes and a larger dynamic range for the input size. In logarithmic plots, it is easer to spot scalability. Additionally, constant speedups are still visible in log plots. Relevant python package: https://github.com/nschloe/perfplot
Regarding the dtype optimizations:
I once had good success by porting my python code to cython, and then back to numba. The cython compiler can show you usefull informations for each line of code about the inferred data types. https://cython.readthedocs.io/en/latest/src/quickstart/build.html#using-the-jupyter-notebook If the cython compiler can infer the dtypes and compile the code to native code, then the chances are high that the numba compiler is also able to generate nice output.
It seems that numba also has some annotation capability: https://numba.readthedocs.io/en/stable/user/cli.html?highlight=annotate
Hi @alvii147 Just checking in here. Any updates?
Hey @seanlaw sorry I've been caught up with school, no new updates as of now. I'll start digging into this again in the coming weeks.
@NimaSarajpoor I tried to follow your changes here but didn't really see that big of an improvement.
Here's what I tried:
for diag_idx in range(diags_start_idx, diags_stop_idx):
k = diags[diag_idx]
if k >= 0:
iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
else:
iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))
for i in iter_range:
ii = np.uint64(i)
ik = np.uint64(i + k)
ikm = np.uint64(i + k + m)
im = np.uint64(i + m)
if ii == 0 or (k < 0 and ii == -k):
cov = np.dot((T_B[ik:ikm] - M_T[ik]), (T_A[ii:im] - μ_Q[ii])) * m_inverse
Here are the results:
Time series length | Window size | stump mean execution time |
stump_uint mean execution time |
Percentage improvement | Results accurate |
---|---|---|---|---|---|
1000 | 10 | 0.00512 | 0.00533 | -4.081% | True |
2500 | 10 | 0.00972 | 0.00945 | 2.7% | True |
5000 | 10 | 0.02288 | 0.02252 | 1.561% | True |
10000 | 100 | 0.06956 | 0.07159 | -2.924% | True |
100000 | 100 | 6.73673 | 7.08620 | -5.187% | True |
100000 | 5000 | 7.25523 | 7.19030 | 0.895% | True |
200000 | 100 | 32.85791 | 32.80870 | 0.15% | True |
200000 | 5000 | 32.97569 | 33.48455 | -1.543% | True |
250000 | 100 | 55.83575 | 55.31517 | 0.932% | True |
Any ideas what we can try from here?
@alvii147 I tried to make the changes again and I got the following results on my pc:
# n: length of time series
# m: window size
n | m | main | uint | improvement(%) |
---|---|---|---|---|
10k | 50 | 0.21 | 0.15 | 28.57142857 |
20k | 50 | 0.77 | 0.51 | 33.76623377 |
25k | 1000 | 1.15 | 0.81 | 29.56521739 |
50k | 50 | 4.82 | 3.44 | 28.63070539 |
50k | 5000 | 6.4 | 4.85 | 24.21875 |
I have provided my code below with some comments:
# in stumpy/stump.py
def _compute_diagonal(...):
n_A = T_A.shape[0]
n_B = T_B.shape[0]
m_inverse = 1.0 / m
constant = (m - 1) * m_inverse * m_inverse # (m - 1)/(m * m)
m = np.uint64(m) # JUST PUT IT OUTSIDE OF FOR-LOOP
for diag_idx in range(diags_start_idx, diags_stop_idx):
k = diags[diag_idx]
if k >= 0:
iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
else:
iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))
for i in iter_range:
# JUST calling np.uint64 twice here.
ii = np.uint64(i)
ik = np.uint64(i + k)
if ii == 0 or ik == 0: # CONDITIONS SIMPLIFIED (WE DO NOT NEED `k < 0` )
# ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
# FOR INSTANCE: `i + k` is replaced with `ik`
cov = (
np.dot(
(T_B[ik : ik + m] - M_T[ik]), (T_A[ii : ii + m] - μ_Q[ii])
)
* m_inverse
)
else:
# ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
cov = cov + constant * (
cov_a[ik] * cov_b[ii] - cov_c[ik] * cov_d[ii]
)
# ALSO: PAY ATTENTION TO OTHER SMALL CHANGES BELOW....
if T_B_subseq_isfinite[ik] and T_A_subseq_isfinite[ii]:
# Neither subsequence contains NaNs
if T_B_subseq_isconstant[ik] or T_A_subseq_isconstant[ii]:
pearson = 0.5
else:
pearson = cov * Σ_T_inverse[ik] * σ_Q_inverse[ii]
if T_B_subseq_isconstant[ik] and T_A_subseq_isconstant[ii]:
pearson = 1.0
if pearson > ρ[thread_idx, ii, 0]:
ρ[thread_idx, ii, 0] = pearson
I[thread_idx, ii, 0] = ik
if ignore_trivial: # self-joins only
if pearson > ρ[thread_idx, ik, 0]:
ρ[thread_idx, ik, 0] = pearson
I[thread_idx, ik, 0] = ii
if ii < ik:
# left pearson correlation and left matrix profile index
if pearson > ρ[thread_idx, ik, 1]:
ρ[thread_idx, ik, 1] = pearson
I[thread_idx, ik, 1] = ii
# right pearson correlation and right matrix profile index
if pearson > ρ[thread_idx, ii, 2]:
ρ[thread_idx, ii, 2] = pearson
I[thread_idx, ii, 2] = ik
return
I haven't played with it yet...so I do not know what particular change has the most impact here.
Also, the performance code:
seed = 0
np.random.seed(seed)
T = np.random.rand(n)
stumpy.stump(T, m)
n_iter = 20
lst = []
for itr in range(n_iter):
tic=time.time()
stumpy.stump(T, m)
toc=time.time()
lst.append(toc - tic)
a = np.array(lst)
np.median(a)
# WHY MEDIAN? BECAUSE I THINK IT IS MORE ROBUST!
Also, small things might affect the computing time, for example:
- if your laptop is plugged or unplugged! (it can change the power of cpu, and hence its performance)
- if you do lots of iteration back to back for both versions of
stump
andstump_uint
(In some cases, the computing time in later iterations may become higher.) So, it is safer if you run the performance for each version, and then get their median (I say median, because in a few cases the computing time suddenly gets high and then it goes back to normal. So, I thinkmedian
is better thanmean
)
I tried to be careful about the two aforementioned items (I was consistent) when I was checking the performance of the two versions.
Oops my bad looks like I screwed up my code for performance evaluation. It's looking better now:
Time series length | Window size | stump mean execution time |
stump_uint mean execution time |
Percentage improvement | Results accurate |
---|---|---|---|---|---|
1000 | 10 | 0.00502 | 0.00476 | 5.126% | True |
2500 | 10 | 0.00977 | 0.00797 | 18.373% | True |
5000 | 10 | 0.02324 | 0.01986 | 14.543% | True |
10000 | 100 | 0.07620 | 0.05531 | 27.415% | True |
100000 | 100 | 7.21882 | 5.64659 | 21.78% | True |
100000 | 5000 | 7.79295 | 6.33472 | 18.712% | True |
200000 | 100 | 36.84522 | 33.80169 | 8.26% | True |
200000 | 5000 | 36.13178 | 34.10573 | 5.607% | True |
250000 | 100 | 60.89817 | 58.27914 | 4.301% | True |
It looks like the performance gap closes as we use larger time series. I can somewhat make sense of that, the performance difference between int
and uint
may be more insignificant as the overall matrix profile computation gets heavier.
@seanlaw what do you think?
Also, what's up with this? Am I missing a dependency?
It looks like the performance gap closes as we use larger time series. I can somewhat make sense of that, the performance difference between int and uint may be more insignificant as the overall matrix profile computation gets heavier. @seanlaw what do you think?
That's much better. However, can we simply see/understand how much of this improvement is coming from the use of range
by itself? I really don't like seeing:
ii = np.uint64(i)
ik = np.uint64(i + k)
as I find it really ugly 😢
Also, what's up with this? Am I missing a dependency?
There seems to be an issue with the distributed
package. I filed an issue here and hopefully a fix is coming soon.
There seems to be an issue with the distributed package. I filed an https://github.com/dask/distributed/issues/7227 and hopefully a fix is coming soon
This has been resolved upstream in Dask. My apologies for the issue. The previous release has been yanked from PyPI and a new release has been published. It is on PyPI now and should be on conda-forge shortly.
@seanlaw
However, can we simply see/understand how much of this improvement is coming from the use of
range
by itself?
I think what @alvii147 is trying to do here is optimizing the existing version of stump
, which already uses range
.
I really don't like seeing:
ii = np.uint64(i) ik = np.uint64(i + k)
as I find it really ugly
Based on what I did, we have two main changes here:
- removing the condition
k < 0
(regardless of usingnp.uint
) - using
ii=np.uint64(i)
andik=np.uint64(i+k)
NOTE: Better names might be i_uint
and ik_sum
(but they are not pretty still)
I think what @alvii147 is trying to do here is optimizing the existing version of stump, which already uses range.
Exactly, @seanlaw range
by itself is what stump
currently uses. The difference is the usage of uint64
.
I really don't like seeing:
ii = np.uint64(i)
ik = np.uint64(i + k)
as I find it really ugly 😢
I agree this looks sketchy, the naming was just for testing, we can change this to something like i_uint
and ik_uint
?
removing the condition k < 0 (regardless of using np.uint)
@NimaSarajpoor this was an interesting optimization, my bad I forgot to include this in my last commit.
Now that I think about this change, do you think this should be a separate issue? That is, an issue that suggests using:
if i == 0 or i == -k:
Instead of
if i == 0 or (k < 0 and i == -k)
I don't think there's much difference in performance with this, but what you suggested looks cleaner. And correct me if I'm wrong @seanlaw, but if k > 0
then we will never satisfy i == -k
anyway, as i
should be positive.
I don't think there's much difference in performance with this, but what you suggested looks cleaner. And correct me if I'm wrong @seanlaw, but if k > 0 then we will never satisfy i == -k anyway, as i should be positive.
@alvii147 Frankly, I don't recall the reasoning for if i == 0 or (k < 0 and i == -k)
as I did this a long time ago. I just remember that it was a necessary check.
I think what @alvii147 is trying to do here is optimizing the existing version of stump, which already uses range.
Thanks for the reminder. I had to go and refresh my memory!
- Can you ensure that
diags
returns unsigned ints so thatk
is an unsigned int (and can then be added to an unsignedi
(see below) - Instead of converting single values into uint, can you cast the entire
iter_range
to uint in one shot? If not, then I propose doing something like:
for signed_i in iter_range:
i = np.uint64(signed_i)
I actually like seeing i + k
everywhere...
@alvii147 Frankly, I don't recall the reasoning for if i == 0 or (k < 0 and i == -k) as I did this a long time ago. I just remember that it was a necessary check.
I'm nearly certain the k < 0
is not necessary, both when you think about this with and without context.
Without context, i
is always positive. Hence if i == -k
, k
will always be negative. Exception is when i = k = 0
, but the i == 0
case is already covered in the first part of the statement.
With context, i == -k
is just checking whether or not we are in the left most column of the distance matrix, and if so, we'd have to use the dot product to compute the distance, as opposed to using the diagonal property. When this is the case, k
should always be negative anyway, as we would be dealing with a negative diagonal index.
Can you ensure that diags returns unsigned ints so that k is an unsigned int (and can then be added to an unsigned i (see below)
Unfortunately that can't happen, as diags
can be negative, hence k
can be negative. i + k
on the other hand, should and will always be positive.
Instead of converting single values into uint, can you cast the entire iter_range to uint in one shot?
I was unable to find a good way to do this. If anything, we'd be going back to using np.arange
, which we now know is really inefficient for this use case.
If not, then I propose doing something like:
for signed_i in iter_range:
i = np.uint64(signed_i)
This looks cleaner! I can do this.
I actually like seeing i + k everywhere...
Agreed, i + k
looks better and is more explanatory than ik
. I have a counter-proposal: how about j = i + k
?
The reason why I propose this is, i
represents the current row of the distance matrix. Because k
is the current diagonal index, i + k
is essentially the current column of the distance matrix. If i
is the row, it makes intuitive sense to name the column j
. What do you think @seanlaw ?
I made the code a bit cleaner, used signed_i
as you suggested and used j
for np.uint64(i + k)
. In terms of performance, the results were roughly the same. Here's what it looks like, let me know what you think.
def _compute_diagonal(...):
n_A = T_A.shape[0]
n_B = T_B.shape[0]
m_inverse = 1.0 / m
constant = (m - 1) * m_inverse * m_inverse # (m - 1)/(m * m)
for diag_idx in range(diags_start_idx, diags_stop_idx):
k = diags[diag_idx]
if k >= 0:
iter_range = range(0, min(n_A - m + 1, n_B - m + 1 - k))
else:
iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))
for signed_i in iter_range:
i = np.uint64(signed_i)
j = np.uint64(i + k)
if i == 0 or (k < 0 and i == -k):
cov = (
np.dot(
(T_B[j : j + np.uint64(m)] - M_T[j]),
(T_A[i : i + np.uint64(m)] - μ_Q[i]),
)
* m_inverse
)
else:
# The next lines are equivalent and left for reference
# cov = cov + constant * (
# (T_B[i + k + m - 1] - M_T_m_1[i + k])
# * (T_A[i + m - 1] - μ_Q_m_1[i])
# - (T_B[i + k - 1] - M_T_m_1[i + k]) * (T_A[i - 1] - μ_Q_m_1[i])
# )
cov = cov + constant * (cov_a[j] * cov_b[i] - cov_c[j] * cov_d[i])
if T_B_subseq_isfinite[j] and T_A_subseq_isfinite[i]:
# Neither subsequence contains NaNs
if T_B_subseq_isconstant[j] or T_A_subseq_isconstant[i]:
pearson = 0.5
else:
pearson = cov * Σ_T_inverse[j] * σ_Q_inverse[i]
if T_B_subseq_isconstant[j] and T_A_subseq_isconstant[i]:
pearson = 1.0
if pearson > ρ[thread_idx, i, 0]:
ρ[thread_idx, i, 0] = pearson
I[thread_idx, i, 0] = j
if ignore_trivial: # self-joins only
if pearson > ρ[thread_idx, j, 0]:
ρ[thread_idx, j, 0] = pearson
I[thread_idx, j, 0] = i
if i < j:
# left pearson correlation and left matrix profile index
if pearson > ρ[thread_idx, j, 1]:
ρ[thread_idx, j, 1] = pearson
I[thread_idx, j, 1] = i
# right pearson correlation and right matrix profile index
if pearson > ρ[thread_idx, i, 2]:
ρ[thread_idx, i, 2] = pearson
I[thread_idx, i, 2] = j
return
Time series length | Window size | stump mean execution time |
stump_uint mean execution time |
Percentage improvement | Results accurate |
---|---|---|---|---|---|
1000 | 10 | 0.00514 | 0.00579 | -12.7% | True |
2500 | 10 | 0.01070 | 0.00911 | 14.811% | True |
5000 | 10 | 0.02291 | 0.02104 | 8.16% | True |
10000 | 100 | 0.07196 | 0.05902 | 17.985% | True |
100000 | 100 | 6.71042 | 5.61217 | 16.366% | True |
100000 | 5000 | 7.37240 | 5.81313 | 21.15% | True |
200000 | 100 | 32.59256 | 31.21777 | 4.218% | True |
200000 | 5000 | 33.01471 | 31.84025 | 3.557% | True |
250000 | 100 | 55.54036 | 54.86022 | 1.225% | True |
@alvii147 There's something that doesn't seem right to me and I'm guessing that your timing comparisons aren't capturing this because you are performing a self-join only and you aren't checking the AB-join accuracy. So, for an AB-join, we'd end up doing iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))
, which means that signed_i
in for signed_i in iter_range:
can be negative and yet you then force it to be unsigned in the next line with i = np.uint64(signed_i)
when, in fact, it must be negative. Am I missing something in the logic here? Does it make sense to allow i = np.uint64(signed_i)
when signed_i
is negative?
So, for an AB-join, we'd end up doing
iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))
, which means thatsigned_i
infor signed_i in iter_range:
can be negative
I don't think so, because iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k))
is only executed if k < 0
.
@alvii147 There's something that doesn't seem right to me and I'm guessing that your timing comparisons aren't capturing this because you are performing a self-join only and you aren't checking the AB-join accuracy.
Good point! Here are the performance results with AB joins instead of self joins:
Time series length | Window size | stump mean execution time |
stump_uint mean execution time |
Percentage improvement | Results accurate |
---|---|---|---|---|---|
1000 | 10 | 0.00496 | 0.00485 | 2.305% | True |
2500 | 10 | 0.01112 | 0.00962 | 13.453% | True |
5000 | 10 | 0.02812 | 0.01997 | 28.995% | True |
10000 | 100 | 0.07364 | 0.05559 | 24.518% | True |
100000 | 100 | 6.82463 | 5.73498 | 15.966% | True |
100000 | 5000 | 7.58997 | 6.07498 | 19.96% | True |
200000 | 100 | 33.59375 | 31.84067 | 5.218% | True |
200000 | 5000 | 35.25538 | 33.29318 | 5.566% | True |
250000 | 100 | 56.13076 | 54.11020 | 3.6% | True |
In terms of accuracy, if we're in a bad spot, tests/test_stump_uint.py
will let us know. :smile:
I don't think so, because iter_range = range(-k, min(n_A - m + 1, n_B - m + 1 - k)) is only executed if k < 0.
Ah gotcha! Thanks for the clarification.
No worries! Any other thoughts before this PR is ready?