stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

[WIP] Add Top-K Nearest Neighbors Matrix Profile #592

Open NimaSarajpoor opened this issue 3 years ago • 264 comments

This PR addresses issue #592. In this PR, we want to extend the function stump and the related ones so that it returns Top-K Nearest Neighbors Matrix Profile (i.e. the k smallest values for each distance profile and their corresponding indices).

What I have done so far:

  • Wrote a naïve implementation of stump_topk
  • Provided a new unit test for function stump

Notes: (1) np.searchsort() is used in the naive.stump_topk. Another naive approach can be traversing distance matrix row-by-row, and using np.argpartition() followed by `np.argsort()'.

(2) I think considering the first k columns in the output of stump_topk is cleaner, and also easier for the user to get access to those top-k values.

I can/will think about a clean way to change the structure of output just before returning it so that the first four columns becomes the same for all k. However, I think that makes it hard to use the output later in other modules.


Add To-Do List

  • [x] Add parameter k to stump
  • [x] Add parameter k to stumped
  • [x] Add parameter k to gpu_stump
  • [x] Add parameter k to scrump
  • [x] Add parameter k to stumpi
  • [ ] Run published tutorials to see if they still work with no problem

Side notes:

  • Add parameter k to non-normalized versions.
  • Check docstrings for functions that require matrix profile P as input. For instance, in stumpy.motifs, we may say something like... "P is (top-1) 1D array matrix profile"

NimaSarajpoor avatar Apr 28 '22 19:04 NimaSarajpoor

@seanlaw I addressed the comments and changed the files accordingly.

NimaSarajpoor avatar Apr 30 '22 05:04 NimaSarajpoor

@seanlaw Also, I did an investigation and, as you said, it is possible that we have different indices when there is a tie. Now, the question is: "Should we really care about it?" Because, at the end of the day, what matters is the values of matrix profile P. Right?

For unit test, we can make sure the P values corresponding to those indices are close to each other (e.g. less than 1e-6). What do you think?

NimaSarajpoor avatar Apr 30 '22 05:04 NimaSarajpoor

Also, I did an investigation and, as you said, it is possible that we have different indices when there is a tie. Now, the question is: "Should we really care about it?" Because, at the end of the day, what matters is the values of matrix profile P. Right?

If possible, we should try to account for the indices in the naive implementation. For naive.stump, I wonder if we can use np.lexsort to sort on multiple columns after we compute the distance profile? What order do we expect the indices to be returned in?

For unit test, we can make sure the P values corresponding to those indices are close to each other (e.g. less than 1e-6). What do you think?

No, we should avoid this for this PR. This should only be a last resort as it will hide other issues. We will keep it in mind but 99% of the time, this is not needed. Btw, since the input arrays in the unit tests are random, I will usually run them 1,000+ times on my machine for new features/tests (especially if there is a big change in the test) before I push the commit. So I don't just run it once and assume that it's fine.

seanlaw avatar Apr 30 '22 12:04 seanlaw

I wonder if we can use np.lexsort to sort on multiple columns after we compute the distance profile? What order do we expect the indices to be returned in?

For naive.stump, with row-wise traversal of distance matrix (i.e. the implementation that I pushed), we can do: np.argsort(..., kind= "stable"). I also checked out the document of np.lexsort and it should work as well.

What order do we expect the indices to be returned in?

Now that we use np.argsort(..., kind= "stable"), or alternatively np.lexsort, we can expect that the original order of indices are preserved throughout sorting.

No, we should avoid this for this PR. This should only be a last resort as it will hide other issues. We will keep it in mind but 99% of the time, this is not needed.

Correct! If there is no tie, the option kind="stable" may not be necessary. That is why I avoided it. And, If there is a tie, we will # ignore indices like what you did in the test_stump.py for cases with some identical subsequences.


Please correct me if I am wrong... I think I am getting your point. You are trying to make sure that naive.stump works correctly with no hidden flaws. And, then, later, we can trust it and then do # ignore indices in case of testing stump with identical subsequences.

NimaSarajpoor avatar Apr 30 '22 15:04 NimaSarajpoor

Now that we use np.argsort(..., kind= "stable"), or alternatively np.lexsort, we can expect that the original order of indices are preserved throughout sorting.

Note that kind="stable" only implies that, given the same input, the output is deterministic. However, it does not imply that it will necessarily match STUMPY's deterministic order. For example, you can have a sorting algorithm that is still considered "stable" if it always returns the second (or even last) index when there is a tie. To be consistent with other parts of STUMPY, please use kind="mergesort" instead.

I think I am getting your point. You are trying to make sure that naive.stump works correctly with no hidden flaws. And, then, later, we can trust it and then do # ignore indices in case of testing stump with identical subsequences.

No, not quite. Currently, I believe that naive.stump (not your version) is guaranteed to return the same indices as stumpy.stump even if there are multiple identical nearest neighbor matches. We want to maintain this and continue asserting that the indices are the same between the naive version and the fast version. When possible, we should never ignore the indices!

seanlaw avatar Apr 30 '22 17:04 seanlaw

Note that kind="stable" only implies that, given the same input, the output is deterministic. However, it does not imply that it will necessarily match STUMPY's deterministic order. For example, you can have a sorting algorithm that is still considered "stable" if it always returns the second (or even last) index when there is a tie.

I did not know that! Interesting.... Thanks for the info.

Currently, I believe that naive.stump (not your version) is guaranteed to return the same indices as stumpy.stump even if there are multiple identical nearest neighbor matches

It is supposed to be guaranteed (I think because they have the similar traversal method), but it is not:

seed = 0
np.random.seed(seed)
T = np.random.uniform(-100.0, 100.0, 20)
m = 3

seed = 1
np.random.seed(seed)
identical = np.random.uniform(-100.0, 100.0, m)

T[0:0+m] = identical
T[10:10+m] = identical
T[17:17+m] = identical

mp = stumpy.stump(T, m)
I = mp[:,1]
#I[10] is 0

naive_mp = original_naive_stump(T, m) #using naive.stump()
naive_I = naive_mp[:,1]
#naive_I[10] is 17

But, the indices of stumpy.stump and naive.stump should have been exactly the same. Right? Maybe I made a mistake somewhere. It would be great if you could try this example out on your end.

NimaSarajpoor avatar Apr 30 '22 17:04 NimaSarajpoor

Our naive approach does NOT need to follow STUMPY's diagonal traversal.

Also, note that way we traverse matters...

traversal_distance matrix

If we go row-by-row, the 5-th distance profile is: A-B-C-D-E. However, if we go diagonally, We have C-D-B-E-A. So, I think we should better stay with diagonal traversal in naive version. (Or, we may use if-condition to check ties and if there is a tie, choose the one with lower index while traversing diagonally)


As you said, naive.stump (not my version) should provide the same indices because it has diagonal traversal, similar to stumpy.stump. However, we just saw that sometimes the indices might differ (in my previous comment). Maybe something goes wrong when stumpy.stump distribute diags among threads. or maybe some small numerical errors, i.e. precisions, ? I have not investigated it yet.

NimaSarajpoor avatar Apr 30 '22 17:04 NimaSarajpoor

Or, we may use if-condition to check ties and if there is a tie, choose the one with lower index while traversing diagonally

No, traversing diagonally should solve all of your problems. If you store a distance and later find another distance that is the same, you do not replace the first index no matter what that index is. You only replace the index if there is a smaller distance. The order in which the distances are encountered is key. This should still hold true for any k if done correctly

seanlaw avatar Apr 30 '22 22:04 seanlaw

I am trying to clarify things and catch up with you. So, if you do not mind, please allow me do a quick recap to make sure I am on the same page as you are.

(1) We would like to believe that existing stumpy.stump and naive.stump give the same P and I EVEN IF there are identical subsequences. NOTE: I found some counter examples

(2) For now, let us assume (1) is correct and everything is 100% OK. If we change the naive.stump to traverse matrix row by row. (which is different than stumpy.stump that traverses diagonally), we would like to believe that they give the same P and I regardless of their traversal method. NOTE: after the investigation, I realized that I was wrong and different traversal methods give different matrix profile indices.


Therefore:

You only replace the index if there is a smaller distance. The order in which the distances are encountered is key

Exactly! I have the same thought. If we plan to have diagonal traversal in distance matrix in the function stumpy.stump, we need to traverse diagonally in the test function as well (and not row-by-row) so that we can expect that they give us the same matrix profile indices

EVEN IF we traverse diagonally in both stumpy.stump and naive.stump, we may still end up with two different matrix profile indices when there are ties.

(Btw, sorry for making you tired on this.)

NimaSarajpoor avatar May 01 '22 01:05 NimaSarajpoor

(1) We would like to believe that existing stumpy.stump and naive.stump give the same P and I EVEN IF there are identical subsequences. NOTE: I found some counter examples

Yes, it's less a belief and more an expectation. What are the counter examples?

(2) For now, let us assume (1) is correct and everything is 100% OK. If we change the naive.stump to traverse matrix row by row. (which is different than stumpy.stump that traverses diagonally), we would like to believe that they give the same P and I regardless of their traversal method. NOTE: after the investigation, I realized that I was wrong and different traversal methods give different matrix profile indices.

So, I don't expect row-wise traversal would give the same result as diagonal-wise traversal. In fact, IIRC, I updated naive.stump from row-wise traversal to diagonal-wise traversal a while back because of this reason.

EVEN IF we traverse diagonally in both stumpy.stump and naive.stump, we may still end up with two different matrix profile indices when there are ties.

Why? How can this be possible?

seanlaw avatar May 01 '22 09:05 seanlaw

Btw, you may want to add a naive.searchsorted something like:

#  tests/naive.py

def searchsorted(a, v):
    indices = np.flatnonzero(v < a)
    if len(indices):
        return indices.min()
    else:
        return len(a)

seanlaw avatar May 01 '22 09:05 seanlaw

Btw, note that stumpy.gpu_stump is row-wise traversal. So, we may consider adding a diagonal_traversal=False flag to naive.stump in the future, which does row-wise traversal rather than diagonal-wise traversal.

seanlaw avatar May 01 '22 09:05 seanlaw

. What are the counter examples?

EVEN IF we traverse diagonally in both stumpy.stump and naive.stump, we may still end up with two different matrix profile indices when there are ties.

Why? How can this be possible?

I am providing one example below.

After further investigation, I think I found the cause of such difference. It seems the discrepancy between stumpy.stump and naive.stump has its root in the numerical error of Pearson Correlation in stumpy.stump (e.g. 1.0000000004 against 1.0000000007).

seed = 0
np.random.seed(seed)
T = np.random.uniform(-100.0, 100.0, 1000)
m = 50

seed = 1
np.random.seed(seed)
identical = np.random.uniform(-100.0, 100.0, m)

identical_start_idx = [0, 150, 300, 450, 600, 750, 900]
for idx in identical_start_idx:
    T[idx:idx + m] = identical
    
mp = stumpy.stump(T, m)
I = mp[:, 1]

naive_mp = original_naive_stump(T, m) #using naive.stump()
naive_I = naive_mp[:, 1]

# discrepancy
[(idx, naive_I[idx], I[idx]) for idx in np.flatnonzero(naive_I - I  !=  0)]
>>>
[(0, 150, 300),
 (300, 150, 900),
 (450, 300, 900),
 (600, 450, 0),
 (750, 600, 300),
 (900, 750, 300)]

What do you think? Should this (small / rare ?) issue be taken care of? As you correctly said, "it's less a belief and more an expectation." Now, I understand that we should expect outputs to be exactly the same as they follow similar process (The only difference is that stumpy.stump uses Pearson Correlation, and convert it to distance at the end)


So, I don't expect row-wise traversal would give the same result as diagonal-wise traversal. In fact, IIRC, I updated naive.stump from row-wise traversal to diagonal-wise traversal a while back because of this reason.

Got it. Thanks for the clarification.


you may want to add a naive.searchsorted something like:

Cool! Thanks for the suggestion! So, if I understand correctly, for naive row-wise traversal, I use np.argsort, and for naive diagonal traversal, I can use naive.searchsorted.

Btw, note that stumpy.gpu_stump is row-wise traversal

I did not take a look at that. Thanks for letting me know about this.

NimaSarajpoor avatar May 01 '22 14:05 NimaSarajpoor

What do you think? Should this (small / rare ?) issue be taken care of?

No, I am less concerned about this example

So, if I understand correctly, for naive row-wise traversal, I use np.argsort, and for naive diagonal traversal, I can use naive.searchsorted.

That sounds reasonable.

seanlaw avatar May 02 '22 00:05 seanlaw

@seanlaw If you think I am missing something, please let me know.

  • I used diagonal traversal to get top-k matrix profile in naive.stump
  • I added a note for row-wise traversal so that I do not forget about it later

If I understand correctly, first I should take care of stumpy.stump and make sure that it passes the new unit tests for top-k matrix profile.

Then, I need to work on naive version of stumpy.stamp to return top-k matrix profile via row-wise traversal, and then I need to work on stumpy.stamp (and gpu_stump) to make sure they passes their new test functions.

NimaSarajpoor avatar May 02 '22 19:05 NimaSarajpoor

If I understand correctly, first I should take care of stumpy.stump and make sure that it passes the new unit tests for top-k matrix profile.

Yes, this is correct. Not only should it pass the unit test for top-k but it should also pass all other unit tests without modifying those unit tests.

Then, I need to work on naive version of stumpy.stamp to return top-k matrix profile via row-wise traversal, and then I need to work on stumpy.stamp (and gpu_stump) to make sure they passes their new test functions.

Hmmm, so stamp should never come into consideration here. I think we need to update all of our unit tests to use naive.stump instead. Top-K should only be relevant to stump functions. All stamp functions are for legacy/reference and should never be used. In fact, we need to update all of our unit tests to call naive.stump instead of naive.stamp (this was likely missed over time as the code base grew). Does that make sense?

seanlaw avatar May 03 '22 14:05 seanlaw

Yes, this is correct. Not only should it pass the unit test for top-k but it should also pass all other unit tests without modifying those unit tests.

Correct! All existing unit tests and the new one(s) for top-k.

Hmmm, so stamp should never come into consideration here. I think we need to update all of our unit tests to use naive.stump instead. Top-K should only be relevant to stump functions.

So, our goal is adding parameter k to only stumpy.stump.

All stamp functions are for legacy/reference and should never be used. In fact, we need to update all of our unit tests to call naive.stump instead of naive.stamp (this was likely missed over time as the code base grew). Does that make sense?

Did you mean ALL test functions in tests/test_stump.py? Because, in that case, it makes sense. But, not ALL tests? correct? My concern is stumpy.gpu_stump that uses row-wise traversal (?). I can see that tests/test_gpu_stump.py uses naive.stamp which does row-traversal as well. So, we should keep using naive.stamp in those cases. Did I get your point?

NimaSarajpoor avatar May 03 '22 18:05 NimaSarajpoor

So, our goal is adding parameter k to only stumpy.stump.

Yes! And, eventually, to stumpy.stumped and stumpy.gpu_stump. Of course, we'll need the same thing for aamp, aamped, and gpu_aamp as well but I'm assuming that it would be trivial once we figure out the normalized versions.

Did you mean ALL test functions in tests/test_stump.py? Because, in that case, it makes sense. But, not ALL tests? correct? My concern is stumpy.gpu_stump that uses row-wise traversal (?). I can see that tests/test_gpu_stump.py uses naive.stamp which does row-traversal as well. So, we should keep using naive.stamp in those cases. Did I get your point?

So, I was thinking that we would update naive.stump to have a row_wise=False input parameter (in addition to k=1) which controls whether it is traversed row-wise or diagonal-wise. Of course, we would update the contents of the naive.stump function itself to also process row-wise. Then, we can update all of the tests accordingly by replacing naive.stamp with naive.stump(..., row_wise=True). I don't mind if naive.stump is a little bit harder to read as long as we add proper comments to it.

Does that make sense? If not, please feel free to ask more clarifying questions! Your questions are helping me think through things as well as it has been a while since I've had to ponder about this.

seanlaw avatar May 04 '22 11:05 seanlaw

@seanlaw

Of course, we would update the contents of the naive.stump function itself to also process row-wise. Then, we can update all of the tests accordingly by replacing naive.stamp with naive.stump(..., row_wise=True)

Got it. We do not touch stumpy.stump regarding this matter. We just add parameter row_wise=False to the naive version, so that we can perform unit tests for stumpy.stump and related functions with just one naive function naive.naive_stump.

Then, we can update all of the tests accordingly by replacing naive.stamp with naive.stump(..., row_wise=True). I don't mind if naive.stump is a little bit harder to read as long as we add proper comments to it. | Does that make sense?

It totally makes sense. Should we take care of this in parallel with this PR? like...at the end before finalizing everything?

NimaSarajpoor avatar May 04 '22 15:05 NimaSarajpoor

@seanlaw I think we have discussed the comments. Should I go ahead and apply comments? I will revise naive.stump so that it can support row-wise traversal as well.

NimaSarajpoor avatar May 05 '22 02:05 NimaSarajpoor

Yes, please proceed!

seanlaw avatar May 05 '22 04:05 seanlaw

@seanlaw I am about to revise stumpy/stump.py as well to return top-k matrix profile. I will let you know when I am done so you can do the review. Thanks.

NimaSarajpoor avatar May 09 '22 19:05 NimaSarajpoor

Should I review the recent commits or shall I wait for the stump.py changes?

seanlaw avatar May 09 '22 19:05 seanlaw

Should I review the recent commits or shall I wait for the stump.py changes?

Please wait for the the stumpy.py changes.

NimaSarajpoor avatar May 09 '22 19:05 NimaSarajpoor

@seanlaw Just wanted to inform you of this behavior in np.searchsorted:

import numpy as np
a = np.array([0, 1, 2, 3, 4, 2, 0], dtype=np.float64)
v = 3.9
idx = np.searchsorted(a, v, side='right') # or side='left', does not matter here!
# idx = 7

So, if P stores top-k AND left & right matrix profile values (i.e. two last columns), we might better do:

idx = np.searchsorted(a[:k], v, side='right') 

I also did a quick search and it seems basic indexing creates a view rather than a copy, and it can be done in constant time. So, it should be fine. What do you think?

NimaSarajpoor avatar May 10 '22 02:05 NimaSarajpoor

Just wanted to inform you of this behavior...

Because of such behavior, and also because top-k rho values for each subsequence are in fact the top-k LARGEST rho values, I did as follows:

# rho: pearson 
# a: array of rho values, with `k+2` columns 
# first k columns correspond to top-k, and last two columns are for left/right mp

# update top-k
if rho > a[..., k-1]:
    idx = len(a) - np.searchsorted(a[..., :k][::-1], rho)

# insert the new value `rho` into the array `a`, at the index `idx`.

Alternatively, we can store top-k smallest -rho (please note the negative sign). But I think keeping track of +rho is cleaner.


The modified function stump.py, which accepts parameter k, passes the new (top-k matrix profile) unit test as well as the other tests in test_stump.py. Please let me know what you think.

NimaSarajpoor avatar May 10 '22 04:05 NimaSarajpoor

@seanlaw Just wanted to inform you of this behavior in np.searchsorted:

import numpy as np
a = np.array([0, 1, 2, 3, 4, 2, 0], dtype=np.float64)
v = 3.9
idx = np.searchsorted(a, v, side='right') # or side='left', does not matter here!
# idx = 7

So, if P stores top-k AND left & right matrix profile values (i.e. two last columns), we might better do:

idx = np.searchsorted(a[:k], v, side='right') 

I also did a quick search and it seems basic indexing creates a view rather than a copy, and it can be done in constant time. So, it should be fine. What do you think?

Computationally, it is "fine". However, I think that it might make sense to split up our left/right matrix profile distances from the global matrix profile distances to avoid confusion. Because, not only do we have to find the insertion point but we also need to make sure that we don't shift the left/right distances off of the vector. So we have P and I for the global matrix profiles (including top-K) and then, separately, have PL, PRand IL, IR which track the left and right distances and indices.

Finally, once everything is computed, we just push these data structures back into out where appropriate

seanlaw avatar May 10 '22 12:05 seanlaw

if rho > a[..., k-1]: idx = len(a) - np.searchsorted(a[..., :k][::-1], rho)

This is a bit confusing. If the past distances are sorted from smallest (index 0) to largest (index k-1), which is expected by np.searchsorted, then shouldn't we be checking if rho > a[0]? Also, if we maintain smallest-to-largest ordering, I believe that we can avoid needing to reverse the vector and subtracting the length in the second line. Currently, this is too confusing to keep in my head. I was thinking maybe something like:

if rho > a[0]:
    idx = np.searchsorted(a[..., :k], rho)

But instead of shifting things to the right during insertion, we shift things towards the left:

a[..., :idx] = a[..., 1:idx+1]
a[..., idx] = rho 

I'm sure that I'm making some mistakes here but I hope you get my point. Please feel free to ask for clarification

seanlaw avatar May 10 '22 12:05 seanlaw

No, I haven't starting reviewing yet so please go ahead.

seanlaw avatar May 10 '22 13:05 seanlaw

Previously, we compare D with 4 (last element), but now we compare it with 0.6 (its equivalent rho value). So, everything is the same. If new pearson is bigger than 0.6, we would like to insert it. Note that array of top-k rho values is sorted in descending order. That is why we need to reverse it every time and then fed it to np.searchsorted :)

That's what I'm saying. We should store rho from smallest to largest (left to right in ascending order) instead of in reverse order if possible

seanlaw avatar May 10 '22 13:05 seanlaw