stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

Fix precision

Open NimaSarajpoor opened this issue 3 years ago • 35 comments

This PR is a replacement for PR #657, which tackles issue #610.

Regarding PR #657: There was a discrepancy between the files in the local repo and the remote repo, but I couldn't figure that out. That is why I am submitting a new PR.

NimaSarajpoor avatar Sep 07 '22 00:09 NimaSarajpoor

@seanlaw This is strange! Now, when I run test, it says "83 files would be left unchanged". Now, this is the same as what we see here in Github Actions. And, I get no error when I run test in my local repo.

$ ./test.sh custom 1 tests/test_stump.py
Cleaning Up
Checking Black Code Formatting
All done! \u2728 \U0001f370 \u2728
83 files would be left unchanged.
Checking Flake8 Style Guide Enforcement
Executing Custom User-Defined Tests Only
Custom Test: 1 / 1
============================= test session starts =============================
platform win32 -- Python 3.9.13, pytest-7.1.2, pluggy-1.0.0
rootdir: C:\Users\AS17983\Desktop\NIMA\stump\stumpy, configfile: pytest.ini
plugins: anyio-3.6.1, cython-0.2.0
collected 27 items

tests\test_stump.py ...........................                          [100%]

============================= 27 passed in 19.44s =============================

NimaSarajpoor avatar Sep 07 '22 00:09 NimaSarajpoor

@NimaSarajpoor Instead of running test.sh, can you simply run this in a Jupyter notebook:

m = 3
zone = int(np.ceil(m / 4))
seed = 1
np.random.seed(seed)
T = np.random.rand(64)
scale = np.random.choice(np.array([0.001, 0, 1000]), len(T), replace=True)
T[:] = T * scale

ref_mp = naive.stump(T, m, exclusion_zone=zone, row_wise=True)
comp_mp = stump(T, m, ignore_trivial=True)

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    if np.abs(ref - cmp) > 1e-7:
        print(i, ref, cmp)

I get:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

If I do:

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    print(i, ref, cmp)

Then I get the following:

0 1.5700924586837752e-16 0.0
1 1.5700924586837752e-16 0.0
2 1.658550663133749e-07 1.6323404237781945e-07
3 2.4257150105478956e-07 2.421152116249788e-07
4 6.454152836022342e-07 6.452392069879462e-07
5 0.22155833569278333 0.2215583356927852
6 0.09204964741053696 0.09204964741054301
7 1.9109645596859633e-07 1.8966081829902314e-07
8 1.5700924586837752e-16 0.0
9 2.220446049250313e-16 0.0
10 0.012944862817693177 0.012944867345093258
11 0.49025821651994234 0.4902582165199456
12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
14 3.3457789477422945e-05 3.3457703025089785e-05
15 0.26636729814247995 0.26636729659228936
16 3.656951842685141e-08 0.0
17 0.46944298605172013 0.4694429860517184
18 0.03103461052726588 0.031034609854201676
19 6.434999199613844e-07 0.0
20 5.535501007487065e-08 0.0
21 8.19261056164571e-10 0.0
22 5.364470225196175e-07 0.0
23 8.030629626643827e-07 8.021753331240875e-07
24 1.9109645596859633e-07 1.8966081829902314e-07
25 1.5700924586837752e-16 0.0
26 0.0 0.0
27 2.53526511563009e-07 2.0157917229039495e-07
28 8.553525901666268e-08 0.0
29 4.6884218468212213e-07 4.6814316726937914e-07
30 0.10353933637363609 0.10353933656145922
31 0.1051316158248435 0.10513161293237559
32 0.48220646477596524 0.48220646544989193
33 8.523796584076095e-07 0.0
34 1.7325318446717188e-06 1.7323200730701582e-06
35 1.5414353887031432e-06 0.0
36 0.04048989128516726 0.040489912232822924
37 3.656951842685141e-08 0.0
38 3.656951842685141e-08 0.0
39 0.26636729814247995 0.26636729659228936
40 1.658550663133749e-07 0.0
41 4.6884218468212213e-07 4.6814316726937914e-07
42 6.460441508916396e-06 6.458583395855192e-06
43 0.03103461052726588 0.031034609854201676
44 0.10353933637363609 0.10353933656145922
45 0.09204964741053696 0.09204964741054301
46 8.19261056164571e-10 0.0
47 8.523796584076095e-07 8.218635459624336e-07
48 0.1051316158248435 0.10513161293237559
49 0.07416506824467556 0.07416506824466002
50 0.012944862817693177 0.012944867345093258
51 0.48220646477596524 0.48220646544989193
52 0.42052087653681697 0.42052087653681747
53 5.535501007487065e-08 0.0
54 2.7194799110210365e-16 0.0
55 0.0 0.0
56 2.760744472197097e-06 2.7529265174893867e-06
57 0.07416506824467556 0.07416506824466002
58 2.3357370178318246e-07 2.264780425067345e-07
59 0.45226821973180753 0.4522682194826431
60 2.486220486843771e-07 0.0
61 1.5700924586837752e-16 0.0

seanlaw avatar Sep 07 '22 01:09 seanlaw

I get:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

I got a different result:

33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
41 4.6884218468212213e-07 0.0
42 6.460441508916396e-06 0.0
56 2.760744472197097e-06 0.0
60 2.486220486843771e-07 0.0

Btw, I compared the my local branch and the remote branch in origin (for both main and fix_precision) to make sure everything is the same. I used the instructions provided in this stackoverflow post. Everything is okay.

NimaSarajpoor avatar Sep 07 '22 11:09 NimaSarajpoor

I got a different result:

I have a few observations:

  1. For testing stump, should we really be setting row_wise=True in naive.stump? Since stump is diagonal-wise, shouldn't we set row_wise=False? This doesn't solve the problem but it is inconsistent
  2. 12 2.950287948697726e-05 0.0 means that naive.stump is returning a non-zero distance for idx = 12 in Github Actions (and in my local branch of your repo). Can you verify what the pairwise distances are for idx = 12? Here are my pairwise distances generated from naive.stump:
0 3.0000184482883
1 3.689691694550767e-05
2 2.999981187305404
3 3.0000188123406852
4 3.807877925156031e-05
5 0.9716146011928827
6 3.365385804112601
7 3.00001930941092
8 3.0000184482883
9 3.689691694550766e-05
10 2.7130038423554645
11 1.293751201679764
12 0.0
13 2.9999872380440378
14 3.0000368085323212
15 1.9609012370025143
16 2.999981551371355
17 2.644560505532764
18 0.4142995718310496
19 3.5963320995608045e-05
20 2.9999819288490794
21 3.0000186910572095
22 3.743336396799641e-05
23 2.9999807857662883
24 3.0000192138645323
25 3.0000184482883
26 3.689691694550767e-05
27 3.660682091553413e-05
28 2.9999816124248677
29 3.0000181726311808
30 0.6555850486723159
31 2.8129002184921874
32 3.373545663525392
33 3.9431652231321966e-05
34 2.9999799194833465
35 3.0000200801134347
36 2.979568418597036
37 3.686034742708289e-05
38 2.999981569656451
39 1.735512606848563
40 2.999981270234494
41 3.000017938214321
42 2.950287948697726e-05
43 0.3834710886174519
44 0.7569603930026928
45 3.342379253523891
46 3.000018690647586
47 4.028403188967316e-05
48 2.8729632528630487
49 0.27224456911612394
50 2.7210340045887635
51 3.2311659747454886
52 0.7619746520574434
53 2.9999819565270847
54 3.0000184482882997
55 3.689691694550767e-05
56 2.9999785390824387
57 0.3461178433786073
58 2.9999820733160356
59 2.316949684170337
60 2.9999817367381696
61 3.0000184482883

I generated this by adding the following three lines to naive.stump:

def stump(T_A, m, T_B=None, exclusion_zone=None, row_wise=False):
    if T_B is None:  # self-join:
        ignore_trivial = True
        distance_matrix = np.array(
            [distance_profile(Q, T_A, m) for Q in core.rolling_window(T_A, m)]
        )
        T_B = T_A.copy()
    else:
        ignore_trivial = False
        distance_matrix = np.array(
            [distance_profile(Q, T_B, m) for Q in core.rolling_window(T_A, m)]
        )

    distance_matrix[np.isnan(distance_matrix)] = np.inf

    # Added these three lines
    if distance_matrix.shape[0] >= 12:
        for i in range(distance_matrix.shape[1]):
            print(i, distance_matrix[12, i])

According to this, subsequence with idx = 42 corresponds to the one-nearest neighbor to idx = 12

seanlaw avatar Sep 07 '22 14:09 seanlaw

  1. For testing stump, should we really be setting row_wise=True in naive.stump? Since stump is diagonal-wise, shouldn't we set row_wise=False? This doesn't solve the problem but it is inconsistent

I think we should. btw, I tried row_wise=False before and then I realized we can set it to row_wise=True IF what we care about is just the matrix profile (and not the matrix profile indices). I mean...in such case, row_wise=True should be okay. However, as you said, we should better set it to row_wise=False for the sake of consistency.

PLEASE IGNORE THE FOLLOWING COMMENT

2. Here are my pairwise distances generated from naive.stump:

from naive.stump or stumpy.stump? Because, the value you provided for idx=12 is 0.0, which is not the same as what you provided earlier: 12 2.950287948697726e-05 0.0 (I think the last value here, i.e. 0.0, is from stumpy.stump)

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.


UPDATE: Okay, I think you set row_wise=False now. Right?

NimaSarajpoor avatar Sep 07 '22 14:09 NimaSarajpoor

from naive.stump or stumpy.stump? Because, the value you provided for idx=12 is 0.0, which is not the same as what you provided earlier: 12 2.950287948697726e-05 0.0 (I think the last value here, i.e. 0.0, is from stumpy.stump)

From naive.stump. If you look at my last output, you should NOT be looking at 12 0.0, which is the self-match/trivial match. Instead, you should be looking at 42 2.950287948697726e-05

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.

This suggests that stumpy.stump is different and not returning 0.0 for some reason

seanlaw avatar Sep 07 '22 14:09 seanlaw

From naive.stump. If you look at my last output, you should NOT be looking at 12 0.0, which is the self-match/trivial match. Instead, you should be looking at 42 2.950287948697726e-05

Just got that! You are right!

In my local repo, I got: 12 2.950287948697726e-05 2.950306866348452e-05.

This suggests that stumpy.stump is different and not returning 0.0 for some reason

That is probably correct. I am going to do what you suggested... in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

for i, (ref, cmp) in enumerate(zip(ref_mp[:, 0], comp_mp[:, 0])):
    if np.abs(ref - cmp) > 1e-7:
        print(i, ref, cmp)

?

Maybe numba performs differently on different platforms... btw, the code in this branch -in my local repo- still fails for seed=26 !!!


Do you feel we are putting too much energy on this? I mean....how do you feel about it? :)

NimaSarajpoor avatar Sep 07 '22 14:09 NimaSarajpoor

2. an you verify what the pairwise distances are for idx = 12? Here are my pairwise distances generated from naive.stump:

0 3.0000184482883
1 3.689691694550767e-05
2 2.999981187305404
3 3.0000188123406852
4 3.807877925156031e-05
5 0.9716146011928827
6 3.365385804112601
7 3.00001930941092
8 3.0000184482883
9 3.689691694550766e-05
10 2.7130038423554645
11 1.293751201679764
12 0.0
13 2.9999872380440378
14 3.0000368085323212
15 1.9609012370025143
16 2.999981551371355
17 2.644560505532764
18 0.4142995718310496
19 3.5963320995608045e-05
20 2.9999819288490794
21 3.0000186910572095
22 3.743336396799641e-05
23 2.9999807857662883
24 3.0000192138645323
25 3.0000184482883
26 3.689691694550767e-05
27 3.660682091553413e-05
28 2.9999816124248677
29 3.0000181726311808
30 0.6555850486723159
31 2.8129002184921874
32 3.373545663525392
33 3.9431652231321966e-05
34 2.9999799194833465
35 3.0000200801134347
36 2.979568418597036
37 3.686034742708289e-05
38 2.999981569656451
39 1.735512606848563
40 2.999981270234494
41 3.000017938214321
42 2.950287948697726e-05
43 0.3834710886174519
44 0.7569603930026928
45 3.342379253523891
46 3.000018690647586
47 4.028403188967316e-05
48 2.8729632528630487
49 0.27224456911612394
50 2.7210340045887635
51 3.2311659747454886
52 0.7619746520574434
53 2.9999819565270847
54 3.0000184482882997
55 3.689691694550767e-05
56 2.9999785390824387
57 0.3461178433786073
58 2.9999820733160356
59 2.316949684170337
60 2.9999817367381696
61 3.0000184482883

NimaSarajpoor avatar Sep 07 '22 14:09 NimaSarajpoor

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

NimaSarajpoor avatar Sep 07 '22 14:09 NimaSarajpoor

in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

After commenting out @njit for _stump and _compute_diagonal, I got:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

Do you feel we are putting too much energy on this? I mean....how do you feel about it? :)

I do not feel this way at all! This shouldn't be happening and it would be good to track it down

seanlaw avatar Sep 07 '22 14:09 seanlaw

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

The good thing is that we've isolated to the idx=12 and idx=42 pair. Somehow, your stump is computing 2.950306866348452e-05, which is very close to the same answer as naive.stump. Note that we are seeing the same test failures in Github Actions but all of them are only failing (first) for Ubuntu and Mac OS. Can you check if it fails for Windows in Github Actions (i.e., does it get past `test_stump.py for Windows)?

seanlaw avatar Sep 07 '22 15:09 seanlaw

I think they are the same. So, as you suggested, this matter is related to stumpy.stump

The good thing is that we've isolated to the idx=12 and idx=42 pair. Somehow, your stump is computing 2.950306866348452e-05, which is very close to the same answer as naive.stump. Note that we are seeing the same test failures in Github Actions but all of them are only failing (first) for Ubuntu and Mac OS.

Can you check if it fails for Windows in Github Actions (i.e., does it get past `test_stump.py for Windows)?

I think it passes

============================= test session starts =============================
platform win32 -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0
rootdir: D:\a\stumpy\stumpy, configfile: pytest.ini
collected 87 items

tests\test_stump.py ...........................                          [ 31%]
tests\test_mstump.py ............................................        [ 81%]
tests\test_stumpi.py ................                                    [100%]

============================= 87 passed in 29.12s =============================

The operation got cancelled for other versions of python.

NimaSarajpoor avatar Sep 07 '22 15:09 NimaSarajpoor

in the meantime, could you please remove @njit decorator from stump and provide me with the results of:

After commenting out @njit for _stump and _compute_diagonal, I got:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
60 2.486220486843771e-07 0.0

And I got:

12 2.950287948697726e-05 0.0
13 1.0329297510446037e-05 0.0
19 6.434999199613844e-07 0.0
22 5.364470225196175e-07 0.0
33 8.523796584076095e-07 0.0
35 1.5414353887031432e-06 0.0
40 1.658550663133749e-07 0.0
47 8.523796584076095e-07 0.0
56 2.760744472197097e-06 0.0
60 2.486220486843771e-07 0.0

Now, without Numba, the values that correspond to idx=12 are the same as what you got. (Note that we STILL use numba to get mean and std. So, MAYBE that is the reason behind the discrepancy between your result and mine) But, this can show that something is happening in numba. Right?

NimaSarajpoor avatar Sep 07 '22 15:09 NimaSarajpoor

@seanlaw It might be a good idea to create a new branch and just have a directory tests that only contains test_stump with only the test function of our interest. I think this can help us detect the problem more quickly and come up with a MWE if needed.

NimaSarajpoor avatar Sep 07 '22 16:09 NimaSarajpoor

Codecov Report

Base: 99.89% // Head: 99.89% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (49651ff) compared to base (cd9c3ff). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #668   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11453    11538   +85     
=======================================
+ Hits        11441    11526   +85     
  Misses         12       12           
Impacted Files Coverage Δ
stumpy/scrump.py 100.00% <ø> (ø)
stumpy/config.py 100.00% <100.00%> (ø)
stumpy/core.py 100.00% <100.00%> (ø)
stumpy/stump.py 100.00% <100.00%> (ø)
stumpy/stumped.py 100.00% <100.00%> (ø)
tests/naive.py 100.00% <100.00%> (ø)
tests/test_core.py 100.00% <100.00%> (ø)
tests/test_stump.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Sep 07 '22 16:09 codecov-commenter

I get:

12 2.950287948697726e-05 0.0

Could you please let me know what is the nn of 12 in your stumpy.stump? (I think you are using Mac. right?)

Explanation In naive, the nn is 42. In my local repo, I got 42 as well when I used stumpy.stump. So, I used the subseqs at indices 12 and 42 to create a new time series T that is used in the new test function that I pushed. However, it seems all tests are passing.

So, maybe the T that I used is not a good input to reproduce that error. Or, maybe 42 is NOT the nn of 12 in your end! That is why I asked if you could let me know the nn index of subseq at index 12.

NimaSarajpoor avatar Sep 07 '22 17:09 NimaSarajpoor

Could you please let me know what is the nn of 12 in your stumpy.stump? (I think you are using Mac. right?)

When I do:

comp_mp[12]

I get:

array([0.0, 9, 9, 42], dtype=object)

So, stumpy.stump (with @njit ON) produces a left and global nearest neighbor of 9 while the right nearest neighbor is 42

When I print their z-norm subsequences, I get:

# idx = 9
stumpy.core.z_norm(T[9:9+m])
array([-0.70710678, -0.70710678,  1.41421356])

# idx = 12
stumpy.core.z_norm(T[12:12+m])
array([-0.70713287, -0.70708069,  1.41421356])

# idx = 42
stumpy.core.z_norm(T[42:42+m])
array([-0.70711201, -0.70710155,  1.41421356])

You can see that they are somewhat close. The last (3rd) element is identical but the first two elements are only slightly different. stumpy.stump is interpreting idx = 9 as being essentially identical to idx = 12

seanlaw avatar Sep 07 '22 18:09 seanlaw

@seanlaw

We just got an error caused by slight imprecision in identical case in a about-to-be-merged PR. (see error)

I was wondering if we should partially (?) solve this as this error appears from time to time in the merging process. I have a new idea but have not implemented it yet. The concept goes as follows:

if pearson > pearson_threshold:
     # S_A: z-norm of subseq in T_A
     # S_B: z-norm of subseq in T_B
     if np.all(S_A == S_B):
            pearson = 1.0

Do you think the np.all(...) can be time consuming?

NimaSarajpoor avatar Sep 13 '22 16:09 NimaSarajpoor

Do you think the np.all(...) can be time consuming?

Instead of np.all(S_A == S_B), I think you'd rather use np.allclose (see docs). The S_A == S_B can be expensive because you are forced to always compare EVERY element in S_A with S_B. I'd be willing to bet that np.allclose short circuits and stops checking if the first elements are not equal and is therefore faster.

Regarding the timing, I don't know how time consuming it would be and we should simply check with some different inputs. First:

T_random = np.random.rand(100_000)

then:

t = np.linspace(0, 50*(2*np.pi), 100_000)
T_sine = np.sin(t)  # Essentially, a perfect sine wave repeated 50 times so the check above is very costly

Finally, a mix of both:

percent = 0.1
T_random_sine = np.where(T_random < percent, T_random, T_sine)  # Essentially, insert random values only a `percent` of the time

Compare their performance with the check.

seanlaw avatar Sep 13 '22 16:09 seanlaw

I'd be willing to bet that np.allclose short circuits and stops checking if the first elements are not equal and is therefore faster.

Hmm, according to this, np.allclose uses all(A == B) behind the scenes. I wonder what numba has implemented behind the scenes. Otherwise, we might want to implement the function ourselves in core.py but that knows to short circuit at the sign of the first mismatch.

seanlaw avatar Sep 13 '22 16:09 seanlaw

@seanlaw Thanks for sharing the links. I took a look at them.

Hmm, according to this, np.allclose uses all(A == B) behind the scenes

I think you are right. I took a look at the source code and I think they use np.all. I will update you after I do some work on this :)

NimaSarajpoor avatar Sep 13 '22 17:09 NimaSarajpoor

side note: It might be worthwhile to note that even small error in $\rho$ can result in large error in z-normalized Euclidean distance.

Example:

m = 3 # size of window
ρ_exact = 0.99999999 --> dist_exact = 0.0002
ρ_with_imprecision = 0.99999998 --> dist_with_imprecision  = 0.0003

It might help us later with understanding the imprecisions.

NimaSarajpoor avatar Sep 17 '22 05:09 NimaSarajpoor

It might be worthwhile to note that even small error in can result in large error in z-normalized Euclidean distance.

Hmm, should we have a config.STUMPY_RHO_THRESHOLD instead of the config.STUMPY_CORRELATION_THRESHOLD? Both?

seanlaw avatar Sep 17 '22 12:09 seanlaw

should we have a config.STUMPY_RHO_THRESHOLD instead of the config.STUMPY_CORRELATION_THRESHOLD?

Did you mean config.STUMPY_DISTANCE_THRESHOLD? (I mean.... aren't RHO and CORRELATION the same thing?)


Also, note that the following numbers

Example:

m = 3 # size of window
ρ_exact = 0.99999999 --> dist_exact = 0.0002
ρ_with_imprecision = 0.99999998 --> dist_with_imprecision  = 0.0003

depend on m. You can see that if we change m from 3 to 100, we will have:

> m = 100 # size of window
> ρ_exact = 0.99999999 --> dist_exact = 0.00144
> ρ_with_imprecision = 0.99999998 --> dist_with_imprecision  = 0.00199

Now, it is worse!

So, MAYBE we should set a threshold for pearson, then calculate its corresponding threshold for distance and use it in naive version (to set the distance value below threshold to 0) just for the sake of unit testing.

NimaSarajpoor avatar Sep 18 '22 04:09 NimaSarajpoor

Did you mean config.STUMPY_DISTANCE_THRESHOLD? (I mean.... aren't RHO and CORRELATION the same thing?)

Oops, you're right!

So, MAYBE we should set a threshold for pearson, then calculate its corresponding threshold for distance and use it in naive version (to set the distance value below threshold to 0) just for the sake of unit testing.

Doesn't this mean that config.STUMPY_CORRELATION_THRESHOLD is too small? Maybe we set it to 1 - 1e-06 instead?

Can you please clarify if the issue is in the naive or performant version?

seanlaw avatar Sep 18 '22 13:09 seanlaw

Doesn't this mean that config.STUMPY_CORRELATION_THRESHOLD is too small? Maybe we set it to 1 - 1e-06 instead?

Can you please clarify if the issue is in the naive or performant version?

I think I tried 1 - 1e-06 once but did not help. I will try again. I will also investigate to find the main cause of discrepancy between naive and performant versions.

NimaSarajpoor avatar Sep 19 '22 01:09 NimaSarajpoor

Did you mean config.STUMPY_DISTANCE_THRESHOLD? (I mean.... aren't RHO and CORRELATION the same thing?)

Oops, you're right

I am thinking about adding something like config.STUMPY_DISTANCE_THRESHOLD=1e-3. I will push commits when I figure things out and then you can see the changes.

NimaSarajpoor avatar Sep 19 '22 01:09 NimaSarajpoor

config.STUMPY_DISTANCE_THRESHOLD=1e-3

What is this for? What will it be used to control or compare against? Where would you use it in the codebase?

seanlaw avatar Sep 19 '22 01:09 seanlaw

config.STUMPY_DISTANCE_THRESHOLD=1e-3

What is this for? What will it be used to control or compare against? Where would you use it in the codebase?

So, this is what I think: Maybe, we should just set pearson to 1.0 when it exceeds a threshold, instead of re-calculating it. Then, we create a new config variable config.STUMPY_DISTANCE_THRESHOLD which is a function of m and the config.STUMPY_PEARSON_THRESHOLD. For instance, if I set config.STUMPY_PEARSON_THRESHOLD to 1-1e-6 and m is 3, then the corresponding config.STUMPY_DISTANCE_THRESHOLD becomes 0.077459. So, we can use it to modify the distance values obtained by naive version. If it becomes below 0.077459, we set it to 0.0.

Then, we compare mp of naive and mp of performant.

(I have not implemented this yet)

NimaSarajpoor avatar Sep 19 '22 02:09 NimaSarajpoor

Maybe, we should just set pearson to 1.0 when it exceeds a threshold, instead of re-calculating it. Then, we create a new config variable config.STUMPY_DISTANCE_THRESHOLD which is a function of m and the config.STUMPY_PEARSON_THRESHOLD. For instance, if I set config.STUMPY_PEARSON_THRESHOLD to 1-1e-6 and m is 3, then the corresponding config.STUMPY_DISTANCE_THRESHOLD becomes 0.077459. So, we can use it to modify the distance values obtained by naive version. If it becomes below 0.077459, we set it to 0.0.

I don't know. This config.STUMPY_DISTANCE_THRESHOLD feels wrong. Additionally, I'm not seeing how you are arriving at 0.077459 from m = 3 and config.STUMPY_PEARSON_THRESHOLD = 1-1e-6.

What I'm observing is that even when config.STUMPY_PEARSON_THRESHOLD = 1-1e-8 (and m = 3), the z-norm distance is still not close to zero.

import math 

p = 1-1e-8
m = 3
math.sqrt(2 * m * (1 - p))  # 0.00024494897489372265

0.00024494897489372265 is definitely not zero

seanlaw avatar Sep 19 '22 13:09 seanlaw