stumpy
                                
                                 stumpy copied to clipboard
                                
                                    stumpy copied to clipboard
                            
                            
                            
                        [WIP] Add Top-K Nearest Neighbors Matrix Profile #592
This PR addresses issue #592. In this PR, we want to extend the function stump and the related ones so that it returns Top-K Nearest Neighbors Matrix Profile (i.e. the k smallest values for each distance profile and their corresponding indices).
What I have done so far:
- Wrote a naïve implementation of stump_topk
- Provided a new unit test for function stump
Notes:
(1) np.searchsort() is used in the naive.stump_topk. Another naive approach can be traversing distance matrix row-by-row, and using np.argpartition() followed by `np.argsort()'.
(2) I think considering the first k columns in the output of stump_topk is cleaner, and also easier for the user to get access to those top-k values.
I can/will think about a clean way to change the structure of output just before returning it so that the first four columns becomes the same for all k. However, I think that makes it hard to use the output later in other modules.
Add To-Do List
- [x] Add parameter ktostump
- [x] Add parameter ktostumped
- [x] Add parameter ktogpu_stump
- [x] Add parameter ktoscrump
- [x] Add parameter ktostumpi
- [ ] Run published tutorials to see if they still work with no problem
Side notes:
- Add parameter kto non-normalized versions.
- Check docstrings for functions that require matrix profile Pas input. For instance, instumpy.motifs, we may say something like... "Pis (top-1) 1D array matrix profile"
@seanlaw I addressed the comments and changed the files accordingly.
@seanlaw
Also, I did an investigation and, as you said, it is possible that we have different indices when there is a tie. Now, the question is: "Should we really care about it?" Because, at the end of the day, what matters is the values of matrix  profile P. Right?
For unit test, we can make sure the P values corresponding to those indices are close to each other (e.g. less than 1e-6). What do you think?
Also, I did an investigation and, as you said, it is possible that we have different indices when there is a tie. Now, the question is: "Should we really care about it?" Because, at the end of the day, what matters is the values of matrix profile P. Right?
If possible, we should try to account for the indices in the naive implementation. For naive.stump, I wonder if we can use np.lexsort to sort on multiple columns after we compute the distance profile? What order do we expect the indices to be returned in?
For unit test, we can make sure the P values corresponding to those indices are close to each other (e.g. less than 1e-6). What do you think?
No, we should avoid this for this PR. This should only be a last resort as it will hide other issues. We will keep it in mind but 99% of the time, this is not needed. Btw, since the input arrays in the unit tests are random, I will usually run them 1,000+ times on my machine for new features/tests (especially if there is a big change in the test) before I push the commit. So I don't just run it once and assume that it's fine.
I wonder if we can use np.lexsort to sort on multiple columns after we compute the distance profile? What order do we expect the indices to be returned in?
For naive.stump, with row-wise traversal of distance matrix (i.e. the implementation that I pushed), we can do: np.argsort(..., kind= "stable").  I also checked out the document of np.lexsort and it should work as well.
What order do we expect the indices to be returned in?
Now that we use np.argsort(..., kind= "stable"), or alternatively np.lexsort, we can expect that the original order of indices are preserved throughout sorting.
No, we should avoid this for this PR. This should only be a last resort as it will hide other issues. We will keep it in mind but 99% of the time, this is not needed.
Correct! If there is no tie, the option kind="stable" may not be necessary. That is why I avoided it. And, If there is a tie, we will # ignore indices like what you did in the test_stump.py for cases with some identical subsequences.
Please correct me if I am wrong...
I think I am getting your point. You are trying to make sure that naive.stump works correctly with no hidden flaws. And, then, later, we can trust it and then do # ignore indices in case of testing stump with identical subsequences.
Now that we use np.argsort(..., kind= "stable"), or alternatively np.lexsort, we can expect that the original order of indices are preserved throughout sorting.
Note that kind="stable" only implies that, given the same input, the output is deterministic. However, it does not imply that it will necessarily match STUMPY's deterministic order. For example, you can have a sorting algorithm that is still considered "stable" if it always returns the second (or even last) index when there is a tie. To be consistent with other parts of STUMPY, please use kind="mergesort" instead.
I think I am getting your point. You are trying to make sure that naive.stump works correctly with no hidden flaws. And, then, later, we can trust it and then do # ignore indices in case of testing stump with identical subsequences.
No, not quite. Currently, I believe that naive.stump (not your version) is guaranteed to return the same indices as stumpy.stump even if there are multiple identical nearest neighbor matches. We want to maintain this and continue asserting that the indices are the same between the naive version and the fast version. When possible, we should never ignore the indices!
Note that kind="stable" only implies that, given the same input, the output is deterministic. However, it does not imply that it will necessarily match STUMPY's deterministic order. For example, you can have a sorting algorithm that is still considered "stable" if it always returns the second (or even last) index when there is a tie.
I did not know that! Interesting.... Thanks for the info.
Currently, I believe that naive.stump (not your version) is guaranteed to return the same indices as stumpy.stump even if there are multiple identical nearest neighbor matches
It is supposed to be guaranteed (I think because they have the similar traversal method), but it is not:
seed = 0
np.random.seed(seed)
T = np.random.uniform(-100.0, 100.0, 20)
m = 3
seed = 1
np.random.seed(seed)
identical = np.random.uniform(-100.0, 100.0, m)
T[0:0+m] = identical
T[10:10+m] = identical
T[17:17+m] = identical
mp = stumpy.stump(T, m)
I = mp[:,1]
#I[10] is 0
naive_mp = original_naive_stump(T, m) #using naive.stump()
naive_I = naive_mp[:,1]
#naive_I[10] is 17
But, the indices of stumpy.stump and naive.stump should have been exactly the same. Right? Maybe I made a mistake somewhere. It would be great if you could try this example out on your end.
Our naive approach does NOT need to follow STUMPY's diagonal traversal.
Also, note that way we traverse matters...

If we go row-by-row, the 5-th distance profile is: A-B-C-D-E. However, if we go diagonally, We have
C-D-B-E-A. So, I think we should better stay with diagonal traversal in naive version. (Or, we may use if-condition to check ties and if there is a tie, choose the one with lower index while traversing diagonally)
As you said, naive.stump (not my version) should provide the same indices because it has diagonal traversal, similar to  stumpy.stump. However, we just saw that sometimes the indices might differ (in my previous comment). Maybe something goes wrong when stumpy.stump distribute diags among threads. or maybe some small numerical errors, i.e. precisions, ? I have not investigated it yet.
Or, we may use if-condition to check ties and if there is a tie, choose the one with lower index while traversing diagonally
No, traversing diagonally should solve all of your problems. If you store a distance and later find another distance that is the same, you do not replace the first index no matter what that index is. You only replace the index if there is a smaller distance. The order in which the distances are encountered is key. This should still hold true for any k if done correctly
I am trying to clarify things and catch up with you. So, if you do not mind, please allow me do a quick recap to make sure I am on the same page as you are.
(1) We would like to believe that existing stumpy.stump and naive.stump give the same P and I EVEN IF there are identical subsequences.  NOTE: I found some counter examples
(2) For now, let us assume (1) is correct and everything is 100% OK. If we change the naive.stump to traverse matrix row by row. (which is different than stumpy.stump that traverses diagonally),  we would like to believe that they give the same P and I regardless of their traversal method. NOTE: after the investigation, I realized that I was wrong and different traversal methods give different matrix profile indices.
Therefore:
You only replace the index if there is a smaller distance. The order in which the distances are encountered is key
Exactly! I have the same thought. If we plan to have diagonal traversal in distance matrix in the function stumpy.stump, we need to traverse diagonally in the test function as well (and not row-by-row) so that we can expect that they give us the same matrix profile indices
EVEN IF we traverse diagonally in both  stumpy.stump and naive.stump, we may still end up with two different matrix profile indices when there are ties.
(Btw, sorry for making you tired on this.)
(1) We would like to believe that existing stumpy.stump and naive.stump give the same P and I EVEN IF there are identical subsequences. NOTE: I found some counter examples
Yes, it's less a belief and more an expectation. What are the counter examples?
(2) For now, let us assume (1) is correct and everything is 100% OK. If we change the naive.stump to traverse matrix row by row. (which is different than stumpy.stump that traverses diagonally), we would like to believe that they give the same P and I regardless of their traversal method. NOTE: after the investigation, I realized that I was wrong and different traversal methods give different matrix profile indices.
So, I don't expect row-wise traversal would give the same result as diagonal-wise traversal. In fact, IIRC, I updated naive.stump from row-wise traversal to diagonal-wise traversal a while back because of this reason.
EVEN IF we traverse diagonally in both stumpy.stump and naive.stump, we may still end up with two different matrix profile indices when there are ties.
Why? How can this be possible?
Btw, you may want to add a naive.searchsorted something like:
#  tests/naive.py
def searchsorted(a, v):
    indices = np.flatnonzero(v < a)
    if len(indices):
        return indices.min()
    else:
        return len(a)
Btw, note that stumpy.gpu_stump is row-wise traversal. So, we may consider adding a diagonal_traversal=False flag to naive.stump in the future, which does row-wise traversal rather than diagonal-wise traversal.
. What are the counter examples?
EVEN IF we traverse diagonally in both stumpy.stump and naive.stump, we may still end up with two different matrix profile indices when there are ties.
Why? How can this be possible?
I am providing one example below.
After further investigation, I think I found the cause of such difference. It seems the discrepancy between stumpy.stump and naive.stump has its root in the numerical error of Pearson Correlation in stumpy.stump (e.g. 1.0000000004 against 1.0000000007).
seed = 0
np.random.seed(seed)
T = np.random.uniform(-100.0, 100.0, 1000)
m = 50
seed = 1
np.random.seed(seed)
identical = np.random.uniform(-100.0, 100.0, m)
identical_start_idx = [0, 150, 300, 450, 600, 750, 900]
for idx in identical_start_idx:
    T[idx:idx + m] = identical
    
mp = stumpy.stump(T, m)
I = mp[:, 1]
naive_mp = original_naive_stump(T, m) #using naive.stump()
naive_I = naive_mp[:, 1]
# discrepancy
[(idx, naive_I[idx], I[idx]) for idx in np.flatnonzero(naive_I - I  !=  0)]
>>>
[(0, 150, 300),
 (300, 150, 900),
 (450, 300, 900),
 (600, 450, 0),
 (750, 600, 300),
 (900, 750, 300)]
What do you think? Should this (small / rare ?) issue be taken care of?  As you correctly said, "it's less a belief and more an expectation."  Now, I understand that we should expect outputs to be exactly the same as they follow similar process (The only difference is that stumpy.stump uses Pearson Correlation, and convert it to distance at the end)
So, I don't expect row-wise traversal would give the same result as diagonal-wise traversal. In fact, IIRC, I updated naive.stump from row-wise traversal to diagonal-wise traversal a while back because of this reason.
Got it. Thanks for the clarification.
you may want to add a naive.searchsorted something like:
Cool! Thanks for the suggestion! So, if I understand correctly, for naive row-wise traversal, I use np.argsort, and for naive diagonal traversal, I can use naive.searchsorted.
Btw, note that stumpy.gpu_stump is row-wise traversal
I did not take a look at that. Thanks for letting me know about this.
What do you think? Should this (small / rare ?) issue be taken care of?
No, I am less concerned about this example
So, if I understand correctly, for naive row-wise traversal, I use np.argsort, and for naive diagonal traversal, I can use naive.searchsorted.
That sounds reasonable.
@seanlaw If you think I am missing something, please let me know.
- I used diagonal traversal to get top-k matrix profile in naive.stump
- I added a note for row-wise traversal so that I do not forget about it later
If I understand correctly, first I should take care of stumpy.stump and make sure that it passes the new unit tests for top-k matrix profile.
Then, I need to work on naive version of stumpy.stamp to return top-k matrix profile via row-wise traversal, and then I need to work on stumpy.stamp (and gpu_stump) to make  sure they passes their new test functions.
If I understand correctly, first I should take care of stumpy.stump and make sure that it passes the new unit tests for top-k matrix profile.
Yes, this is correct. Not only should it pass the unit test for top-k but it should also pass all other unit tests without modifying those unit tests.
Then, I need to work on naive version of stumpy.stamp to return top-k matrix profile via row-wise traversal, and then I need to work on stumpy.stamp (and gpu_stump) to make sure they passes their new test functions.
Hmmm, so stamp should never come into consideration here. I think we need to update all of our unit tests to use naive.stump instead. Top-K should only be relevant to stump functions. All stamp functions are for legacy/reference and should never be used. In fact, we need to update all of our unit tests to call naive.stump instead of naive.stamp (this was likely missed over time as the code base grew). Does that make sense?
Yes, this is correct. Not only should it pass the unit test for
top-kbut it should also pass all other unit tests without modifying those unit tests.
Correct! All existing unit tests and the new one(s) for top-k.
Hmmm, so
stampshould never come into consideration here. I think we need to update all of our unit tests to usenaive.stumpinstead. Top-K should only be relevant tostumpfunctions.
So, our goal is  adding parameter k to only  stumpy.stump.
All
stampfunctions are for legacy/reference and should never be used. In fact, we need to update all of our unit tests to callnaive.stumpinstead ofnaive.stamp(this was likely missed over time as the code base grew). Does that make sense?
Did you mean ALL test functions in tests/test_stump.py? Because, in that case, it makes sense. But, not ALL tests? correct? My concern is stumpy.gpu_stump that uses row-wise traversal (?). I can see that tests/test_gpu_stump.py uses naive.stamp which does row-traversal as well.  So, we should keep using naive.stamp in those cases. Did I get your point?
So, our goal is adding parameter k to only stumpy.stump.
Yes! And, eventually, to stumpy.stumped and stumpy.gpu_stump. Of course, we'll need the same thing for aamp, aamped, and gpu_aamp as well but I'm assuming that it would be trivial once we figure out the normalized versions.
Did you mean ALL test functions in tests/test_stump.py? Because, in that case, it makes sense. But, not ALL tests? correct? My concern is stumpy.gpu_stump that uses row-wise traversal (?). I can see that tests/test_gpu_stump.py uses naive.stamp which does row-traversal as well. So, we should keep using naive.stamp in those cases. Did I get your point?
So, I was thinking that we would update naive.stump to have a row_wise=False input parameter (in addition to k=1) which controls whether it is traversed row-wise or diagonal-wise. Of course, we would update the contents of the naive.stump function itself to also process row-wise. Then, we can update all of the tests accordingly by replacing naive.stamp with naive.stump(..., row_wise=True). I don't mind if naive.stump is a little bit harder to read as long as we add proper comments to it.
Does that make sense? If not, please feel free to ask more clarifying questions! Your questions are helping me think through things as well as it has been a while since I've had to ponder about this.
@seanlaw
Of course, we would update the contents of the
naive.stumpfunction itself to also process row-wise. Then, we can update all of the tests accordingly by replacingnaive.stampwithnaive.stump(..., row_wise=True)
Got it. We do not touch stumpy.stump regarding this matter. We  just add parameter row_wise=False to the naive version, so that we can perform unit tests for stumpy.stump and related functions with just one naive function naive.naive_stump.
Then, we can update all of the tests accordingly by replacing naive.stamp with naive.stump(..., row_wise=True). I don't mind if naive.stump is a little bit harder to read as long as we add proper comments to it. | Does that make sense?
It totally makes sense. Should we take care of this in parallel with this PR? like...at the end before finalizing everything?
@seanlaw
I think we have discussed  the comments. Should I go ahead and apply comments? I will revise naive.stump so that it can support row-wise traversal as well.
Yes, please proceed!
@seanlaw
I am about to revise stumpy/stump.py as well to return top-k matrix profile. I will let you know when I am done so you can do the review. Thanks.
Should I review the recent commits or shall I wait for the stump.py changes?
Should I review the recent commits or shall I wait for the stump.py changes?
Please wait for the the stumpy.py changes.
@seanlaw
Just wanted to inform you of this behavior in np.searchsorted:
import numpy as np
a = np.array([0, 1, 2, 3, 4, 2, 0], dtype=np.float64)
v = 3.9
idx = np.searchsorted(a, v, side='right') # or side='left', does not matter here!
# idx = 7
So, if P stores top-k AND left & right matrix profile values (i.e. two last columns), we might better do:
idx = np.searchsorted(a[:k], v, side='right') 
I also did a quick search and it seems basic indexing creates a view rather than a copy, and it can be done in constant time. So, it should be fine. What do you think?
Just wanted to inform you of this behavior...
Because of such behavior, and also because top-k rho values for each subsequence are in fact the top-k LARGEST rho values, I did as follows:
# rho: pearson 
# a: array of rho values, with `k+2` columns 
# first k columns correspond to top-k, and last two columns are for left/right mp
# update top-k
if rho > a[..., k-1]:
    idx = len(a) - np.searchsorted(a[..., :k][::-1], rho)
# insert the new value `rho` into the array `a`, at the index `idx`.
Alternatively, we can store top-k smallest -rho (please note the negative sign). But I think keeping track of +rho is cleaner.
The modified function stump.py, which accepts parameter k, passes the new  (top-k matrix profile) unit test as well as the other tests in test_stump.py. Please let me know what you think.
@seanlaw Just wanted to inform you of this behavior in
np.searchsorted:import numpy as np a = np.array([0, 1, 2, 3, 4, 2, 0], dtype=np.float64) v = 3.9 idx = np.searchsorted(a, v, side='right') # or side='left', does not matter here! # idx = 7So, if
Pstores top-k AND left & right matrix profile values (i.e. two last columns), we might better do:idx = np.searchsorted(a[:k], v, side='right')I also did a quick search and it seems basic indexing creates a view rather than a copy, and it can be done in constant time. So, it should be fine. What do you think?
Computationally, it is "fine". However, I think that it might make sense to split up our left/right matrix profile distances from the global matrix profile distances to avoid confusion. Because, not only do we have to find the insertion point but we also need to make sure that we don't shift the left/right distances off of the vector. So we have P and I for the global matrix profiles (including top-K) and then, separately, have PL, PRand IL, IR which track the left and right distances and indices.
Finally, once everything is computed, we just push these data structures back into out where appropriate
if rho > a[..., k-1]: idx = len(a) - np.searchsorted(a[..., :k][::-1], rho)
This is a bit confusing. If the past distances are sorted from smallest (index 0) to largest (index k-1), which is expected by np.searchsorted, then shouldn't we be checking if rho > a[0]?  Also, if we maintain smallest-to-largest ordering, I believe that we can avoid needing to reverse the vector and subtracting the length in the second line. Currently, this is too confusing to keep in my head. I was thinking maybe something like:
if rho > a[0]:
    idx = np.searchsorted(a[..., :k], rho)
But instead of shifting things to the right during insertion, we shift things towards the left:
a[..., :idx] = a[..., 1:idx+1]
a[..., idx] = rho 
I'm sure that I'm making some mistakes here but I hope you get my point. Please feel free to ask for clarification
No, I haven't starting reviewing yet so please go ahead.
Previously, we compare D with 4 (last element), but now we compare it with 0.6 (its equivalent rho value). So, everything is the same. If new pearson is bigger than 0.6, we would like to insert it. Note that array of top-k rho values is sorted in descending order. That is why we need to reverse it every time and then fed it to np.searchsorted :)
That's what I'm saying. We should store rho from smallest to largest (left to right in ascending order) instead of in reverse order if possible