stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

Fix 677 fix bug in snippet module

Open NimaSarajpoor opened this issue 3 years ago • 16 comments

This PR addresses issue #677. It resolves the bug.

After solving the issue, I realized there exist another problem. In some cases, there is a (small?) difference between the output of the performant and naive version. This might be related to the slight imprecision in calculation.

I added a new test function to tests/test_snippet.py with a specified random seed to reproduce the error when I push the commits here.

NimaSarajpoor avatar Sep 16 '22 09:09 NimaSarajpoor

Codecov Report

Base: 99.89% // Head: 99.89% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (63caaab) compared to base (cd9c3ff). Patch coverage: 100.00% of modified lines in pull request are covered.

:exclamation: Current head 63caaab differs from pull request most recent head d3de095. Consider uploading reports for the commit d3de095 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #678   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11453    11479   +26     
=======================================
+ Hits        11441    11467   +26     
  Misses         12       12           
Impacted Files Coverage Δ
stumpy/aampdist_snippets.py 100.00% <100.00%> (ø)
stumpy/snippets.py 100.00% <100.00%> (ø)
tests/naive.py 100.00% <100.00%> (ø)
tests/test_aampdist_snippets.py 100.00% <100.00%> (ø)
tests/test_snippets.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Sep 16 '22 15:09 codecov-commenter

Do we need to update aamp_snippets.py as well?

I think we are good in the non-normalized version.

NimaSarajpoor avatar Sep 16 '22 17:09 NimaSarajpoor

I think we are good in the non-normalized version.

Why wouldn't we need to also do:

M_T[start : start + m],

In the non-normalized version?

seanlaw avatar Sep 16 '22 17:09 seanlaw

I think we are good in the non-normalized version.

Why wouldn't we need to also do:

M_T[start : start + m],

In the non-normalized version?

I did not go through the code thoroughly. But, if I remember correctly, I did not find M_T (in non-normalized version) and then I remembered non-normalized version does not have M_T. (or maybe it has in non-normalized snippet module and I missed it)

NimaSarajpoor avatar Sep 16 '22 17:09 NimaSarajpoor

@seanlaw btw, for k=2, we got error (PROBABLY due to imprecision). I changed it to k=1 as my goal was to fix this issue for now.

I think you may want to check the newly-added test functions for k=2 and see the error (I think I got error for seed=15. not sure though). And, if you think it is related to imprecision, we can just close this PR with k=1 for the new test functions.

NimaSarajpoor avatar Sep 16 '22 17:09 NimaSarajpoor

I think we also need to revise this line of code in both normalized and non-normalized version:

# last for-loop in function snippets / aampdist_snippets
total_min = total_min - mask.astype(np.float64)

I think the basic idea here is to just make total_min negative so they cannot be considered in the next iteration of the for-loop when we do mask = snippets_profiles[i] <= total_min.

I think a better /reasonable approach is to do:

total_min[mask] = np.NINF

I think total_min = total_min - mask.astype(np.float64) is effective ONLY IF total_min is less than 1. However, this is not gauranteed. So, np.NINF is better.

NimaSarajpoor avatar Sep 16 '22 19:09 NimaSarajpoor

I think total_min = total_min - mask.astype(np.float64) is effective ONLY IF total_min is less than 1. However, this is not gauranteed. So, np.NINF is better.

Honestly, I don't remember this part and how it works. I'll trust your judgement but please double check

seanlaw avatar Sep 16 '22 20:09 seanlaw

@seanlaw Please feel free to check out the last test function I added to the snippet unit test. As you can see, the test fails.

NimaSarajpoor avatar Sep 16 '22 21:09 NimaSarajpoor

Please feel free to check out the last test function I added to the snippet unit test. As you can see, the test fails.

The difference in value seems far too big. Have you been able to confirm the root cause for the difference?

seanlaw avatar Sep 17 '22 00:09 seanlaw

@seanlaw

Have you been able to confirm the root cause for the difference?

So, after some investigation, I noticed that both naive and performant version have close values in snippets_profiles (i.e. npt.assert_almost_equal(...) does not fail for this output). However, we can see naive and performant have different results in snippets_fractions. Note that snippets_fractions is calculated as follows:

# total_min = np.min(snippets_profiles, axis=0)

mask = snippets_profiles[i] <= total_min
snippets_fractions[i] = np.sum(mask) / total_min.shape[0]

Therefore, a small imprecision in float values in snippets_profiles can change a boolean value in mask, which means the numerator np.sum(mask) can be changed by 1 (which is a big number). So, if you remove the constant denominator total_min.shape[0] and checkout the values snippets_fractions again (so basically just np.sum(mask)), you can see the difference is just 1 unit! I do not have the values right now, but if I remember correctly, the ref(naive) was 28, 19, 10 and comp (performant) was 27, 20, 10. So, we can see there is 1 unit of difference for the k=3 snippets.

I may investigate further to find the exact element in snippets_profiles that gives different boolean value when we do snippets_profiles[i] <= total_min.... (i.e. between the two versions naive and performant)


Update And, this is what I found after further investigation:

>>> ref_profiles[:, 35] 
array([0.       , 0.       , 0.0361013])

>>> cmp_profiles[:, 35]
array([1.94138742e-06, 1.38988779e-07, 3.61012982e-02])

Note that in line 275 of stumpy/snippet.py (of this branch), we can see: total_min = np.min(snippets_profiles, axis=0) Hence:

ref_profiles --> total_min[35] is 0
cmp_profiles --> total_min[35] is 1.38988779e-07

And thus, in the first iteration of the for-loop in lines 277-282, mask[35] is False in naive version but it is True in performant version.

So, to wrap up: imprecision

NimaSarajpoor avatar Sep 17 '22 01:09 NimaSarajpoor

So, to wrap up: imprecision

Technically speaking, which one is actually "off"? Is it the naive version or the performant version?

seanlaw avatar Sep 17 '22 12:09 seanlaw

which one is actually "off"? Is it the naive version or the performant version?

It is the performant version. To be more precise, the distance profile between query T[40:43] and T must be 0 at index 40 (because they are the same!!) However, this value, obtained through core._mass(...), is 1.9 * 10e-6 instead. And then, I investigated further and I realized the calculated pearson, in stumpy/core.py::_calculate_squared_distance, between T[40:43] and T[40:43] becomes 0.9999999999994555 (should have been 1.0).

So, what can we do?

  • Add threshold for pearson in stumpy/core.py::_calculate_squared_distance.
  • ~Replace numba fastmath=True flag by a less restrictive set of flags: e.g. avoid using arcp, reassoc (but then we need to compare the computation time of the new version and old version)~[DOES NOT HELP]

side note In stumpy/core.py::_calculate_squared_distance, we have the following line:

D_squared = np.abs(2 * m * (1.0 - (QT - m * μ_Q * M_T) / denom))

However, I think we should better do this:

pearson = QT - m * μ_Q * M_T) / denom
if pearson > 1.0:
    pearson = 1.0
D_squared = np.abs(2 * m * (1.0 - pearson))

Because, now, if pearson becomes bigger than 1.0, say 1.00001, we reset it to 1.0, which makes D_squared zero. However, previous code gives non-zero value for D_squared.

NimaSarajpoor avatar Sep 18 '22 04:09 NimaSarajpoor

@seanlaw

which one is actually "off"? Is it the naive version or the performant version?

It is the performant version. To be more precise, the distance profile between query T[40:43] and T must be 0 at index 40 (because they are the same!!)

The query T[40:43] is Q=np.array([-840.66778922 -886.33759718 -843.33818273]). Now, if we use the formula:

denom = m * σ_Q * Σ_T
pearson = (QT - m * μ_Q * M_T) / denom  # in this case, QT can be calculated as follows np.dot(Q, Q) 

to calculate pearson, we can see that it is not 1.0. Hence, the pearson-distance relationship itself causes imprecision.

NimaSarajpoor avatar Sep 18 '22 07:09 NimaSarajpoor

Because, now, if pearson becomes bigger than 1.0, say 1.00001, we reset it to 1.0, which makes D_squared zero. However, previous code gives non-zero value for D_squared.

Awesome. I like that! Please add it. It should be this, right:

if pearson > config.STUMPY_CORRELATION_THRESHOLD:
    pearson = 1.0

Btw, should there be an equivalent lower bound? Could we use np.clip? Should this go in PR #668?

seanlaw avatar Sep 18 '22 13:09 seanlaw

It should be this, right:

if pearson > config.STUMPY_CORRELATION_THRESHOLD:
    pearson = 1.0

Correct. It can now resolve the following issue as well

and I realized the calculated pearson, in stumpy/core.py::_calculate_squared_distance, between T[40:43] and T[40:43] becomes 0.9999999999994555 (should have been 1.0).


Btw, should there be an equivalent lower bound? Could we use np.clip?

Technically yes. But, practically speaking, we should be okay if we do not use it. For instance, let's assume the subsequence S_i and its 1NN has pearson correlation -1. And, let's say there is another subsequence, S_j, whose pearson correlation with its 1NN is -1.002. If we do not replace -1.002 with -1.0, we will get a distance that is bigger than the possible maximum value and we may choose it as the first discord before considering S_i.

If you still think we should consider lower bound, it should be okay to just do:

if pearson < -1.0:
     pearson = -1.0

Should this go in PR https://github.com/TDAmeritrade/stumpy/pull/668?

The problem in this PR is in mass and the problem in PR #668 is in _stump. However, they are somehow related because the main cause is the pearson correlation in both cases. so, I can modify both functions _calculate_squared_distance and _stump in PR #668. (and then add the last test function in test_snippets.py in the PR #668 .

What do you think?

NimaSarajpoor avatar Sep 19 '22 00:09 NimaSarajpoor

The problem in this PR is in mass and the problem in PR https://github.com/TDAmeritrade/stumpy/pull/668 is in _stump. However, they are somehow related because the main cause is the pearson correlation in both cases. so, I can modify both functions _calculate_squared_distance and _stump in PR https://github.com/TDAmeritrade/stumpy/pull/668. (and then add the last test function in test_snippets.py in the PR https://github.com/TDAmeritrade/stumpy/pull/668 .

I'd say please add modify both functions _calculate_squared_distance and _stump in PR #668 but add the last test function here in #677.

seanlaw avatar Sep 19 '22 01:09 seanlaw

side note In stumpy/core.py::_calculate_squared_distance, we have the following line:

D_squared = np.abs(2 * m * (1.0 - (QT - m * μ_Q * M_T) / denom))

However, I think we should better do this:

pearson = QT - m * μ_Q * M_T) / denom
if pearson > 1.0:
    pearson = 1.0
D_squared = np.abs(2 * m * (1.0 - pearson))

Because, now, if pearson becomes bigger than 1.0, say 1.00001, we reset it to 1.0, which makes D_squared zero. However, previous code gives non-zero value for D_squared.

I think we addressed it.

https://github.com/TDAmeritrade/stumpy/blob/1556d7e392ee87fe924e25059a8c6607dab047eb/stumpy/core.py#L997-L998

@seanlaw

which one is actually "off"? Is it the naive version or the performant version? It is the performant version. To be more precise, the distance profile between query T[40:43] and T must be 0 at index 40 > (because they are the same!!) The query T[40:43] is Q=np.array([-840.66778922 -886.33759718 -843.33818273]). Now, if we use the formula:

denom = m * σ_Q * Σ_T pearson = (QT - m * μ_Q * M_T) / denom # in this case, QT can be calculated as follows np.dot(Q, Q) to calculate pearson, we can see that it is not 1.0. Hence, the pearson-distance relationship itself causes imprecision.

We addressed it recently for snippet module, where we reset the distance between a subseq with itself to 0.


I think you can close this PR.

NimaSarajpoor avatar Apr 19 '23 16:04 NimaSarajpoor