stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

[WIP] Fixed #606: DAMP Notebook Reproducer / Tutorial

Open NimaSarajpoor opened this issue 2 years ago • 40 comments

I have tried a couple of times before to implement DAMP (see #606) but I noticed a couple of issues. Recently, I decided to give myself another chance and implement DAMP.

Tasks

  • [x] Implemented naive function for testing

  • [x] Implemented Table 1, 2, 3 of DAMP paper

  • [x] Reproduce a couple of figures [Data_to_Reproduce_Fig2_Fig3.zip]

  • [x] Enhance performance (partially done! Let's look for more opportunities and see MATLAB code as well)

  • [ ] Benchmarking

  • [ ] Implement the remaining tables / different versions of DAMP


Data The DAMP's supporting webapge provides that data used in the paper: see: https://sites.google.com/view/discord-aware-matrix-profile/dataset?authuser=0

The data used to reproduce Fig.2 and Fig.3 of the paper is provided here for convenience: Data_to_Reproduce_Fig2_Fig3.zip


Some notes from MATLAB / powerpoint slides

  • In most cases, the subsequence length you should use is between about 50 to 90% of a typical period (as rule of thumb, one may consider 70%.)
  • A good initial value of lookahead is about 2^nearest_power_of_two(16 * m); Note that this is different than the initial version where m is used instead of 16*m (Personal opinion: better to avoid 16 unless we find a good justifcation)
  • (Almost) constant subsequence should be handled with care. (Personal opinion: We may add the argument ignore_constant to API.)
  • When the subsequence with start index i is pruned: Left_MP(i) = Left_MP(i-1)-0.00001; Note that this is different than the initial version where we had Left_MP(i) = Left_MP(i-1) (The reason behind substracting a small value: To prevent a pruned subsequence from having the same score as the real best-so-far discord)
  • lookahead can be set as parameter. Whenever it is 0, then DAMP becomes a pure online algorithm with no pruning
  • Find top-k discords: If top-1 discord is at index idx, then the the second discord is the index of the global maximum in PL[:idx], where PL is the approximate left matrix profile obtained using DAMP algorithm. (Personal opinion: I think, for a given approx. left matrix profile computed by DAMP, the maximum value for k in top-k discord can be computed after the computation of approx. left matrix profile. Also, exclusion zone should be applied as well. So, I am not sure how user can set k and expect to see exactly k discords in the output. We should take another look at top-k DAMP MATLAB code)

NimaSarajpoor avatar Jun 11 '23 14:06 NimaSarajpoor

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Codecov Report

All modified lines are covered by tests :white_check_mark:

Comparison is base (857063c) 98.93% compared to head (df6e201) 98.93%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #872   +/-   ##
=======================================
  Coverage   98.93%   98.93%           
=======================================
  Files          84       84           
  Lines       14292    14292           
=======================================
  Hits        14140    14140           
  Misses        152      152           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jun 11 '23 15:06 codecov-commenter

[update]

  • I made some minor improvements to the code such as enhancing docstrings, texts, etc.

  • Fig. 2 and Fig. 3 of the paper are reproduced.

  • To reproduce the figures and check the answers, we can use link1 and/or link2

  • DAMP's MATLAB code (in particular, the "BackProcessing" part) is a bit different compared to what presented in the paper. As mentioned in one of the previous comments in this PR, the "BackProcessing" algorithm provided in the table is not optimized as it recomputes some of previously-analyzed subsequences. The MATLAB code, however, considers this: see this MATLAB code

  • DAMP's performance needs to be improved. We can check if the MATLAB code considers any other logic to improve performance as it seems it is different than what presented in the paper. The current notebook in this PR is based on the algorithm provided in the paper. We may also apply this method on LONG time series, say 500k, and compare its performance against the naive version.

NimaSarajpoor avatar Jun 11 '23 23:06 NimaSarajpoor

[update]

I improved the performance by:

  • using core._mass intead of core.mass
  • avoiding recomputing distances for the already-analyzed subsequences (this is different than the paper but somewhat similar to the code provided in MATLAB)

Also, I simplified the code by removing if/else and applying necessary changes to the code

I was able to find top-1 discord in a time series with length ~250k, in less than three minutes.

NimaSarajpoor avatar Jun 12 '23 03:06 NimaSarajpoor

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:50Z ----------------------------------------------------------------

"A subsequence is A DISCORD"

"left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"

I am purposely trying to avoid using "anomaly" where possible and to try and be precise


NimaSarajpoor commented on 2023-06-12T16:13:21Z ----------------------------------------------------------------

"left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"

just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well?

NimaSarajpoor commented on 2023-06-12T16:14:04Z ----------------------------------------------------------------

I am purposely trying to avoid using "anomaly" where possible and to try and be precise

I see! Thanks for the input!

seanlaw commented on 2023-06-12T17:24:22Z ----------------------------------------------------------------

just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well?

Oops! How about:

"left nearest neighbors so AS NOT TO miss the case where you have two rare subsequences that happen to also be similar to each other (i.e., a "twin freak")"

NimaSarajpoor commented on 2023-06-12T20:30:42Z ----------------------------------------------------------------

Sounds good. I will revise the code accordingly.

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:51Z ----------------------------------------------------------------

"interested in finding A DISCORD in an offline setting"

My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly


NimaSarajpoor commented on 2023-06-12T16:17:48Z ----------------------------------------------------------------

My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly

Cool!

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:52Z ----------------------------------------------------------------

"Getting Started" instead of "Import Libraries"


NimaSarajpoor commented on 2023-06-12T16:18:57Z ----------------------------------------------------------------

Noted. Will revise it accordingly.

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:53Z ----------------------------------------------------------------

"can compute THE (left) matrix profile"

"that is the start index of the DISCORD"


NimaSarajpoor commented on 2023-06-12T16:20:32Z ----------------------------------------------------------------

Again, the missing article THE :) Shame on me :D Thanks for catching that.

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:54Z ----------------------------------------------------------------

Line #27.        stumpy.config.STUMPY_EXCL_ZONE_DENOM = 1

I think this should be left OUTSIDE of naive_DAMP and, instead, in main, you'd do:

excl_zone = stumpy.config.STUMPY_EXCL_ZONE_DENOM

stumpy.config.STUMPY_EXCL_ZONE_DENOM = 1 naive_DAMP(T, m, split_idx) stumpy.config.STUMPY_EXCL_ZONE_DENOM = excl_zone


NimaSarajpoor commented on 2023-06-12T16:30:18Z ----------------------------------------------------------------

I will revise the code accordinly.

Side note:

I think I designed the code in a way that should work for any exclusion zone. I need to test it. (The algorithms in the paper are written in way that only consider excl_zone == m. For now, I am trying to be faithful to the paper as much as possible. That is why I am setting the excl_zone denom to 1 to reflect the paper's implementation)

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:55Z ----------------------------------------------------------------

Line #39.        PL_modified = np.where(PL==np.inf, np.NINF, PL)

It feels like we should make this optional rather than only allowing discords to be finite values? Maybe add a parameter of keep_infinite=False.


NimaSarajpoor commented on 2023-06-12T16:36:17Z ----------------------------------------------------------------

I think the only way one can get a discord with infinte value (score) is to have at least one non-finite value in the time series. Wouldn't it be better to not add the proposed argument to the API and instead just let user know regarding this matter?

I mean they can just do: core.rolling_finite(T[split_index:], m) .

What do you think?

seanlaw commented on 2023-06-12T17:29:11Z ----------------------------------------------------------------

I agree. Let's just put a clear note in the description for PL (i.e., all infinite distances are ignored)

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:55Z ----------------------------------------------------------------

Line #9.        discord_score, 

Instead of discord_score, just use bsf so it follows the paper


NimaSarajpoor commented on 2023-06-12T16:36:53Z ----------------------------------------------------------------

Will do so.

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:57Z ----------------------------------------------------------------

Line #10.        is_subseq_pruned,

Just call this pruned


View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:58Z ----------------------------------------------------------------

Line #36.        query_idx : int

Why have separate query_idx and start ? Aren't they ALWAYS the same value?


NimaSarajpoor commented on 2023-06-12T20:59:06Z ----------------------------------------------------------------

You are right! In _foreward_processing they are always the same value. I will remove the intermediate variablestart.

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:51:58Z ----------------------------------------------------------------

Line #52.        lookahead = np.power(2, int(np.ceil(np.log(m) / np.log(2))))

I recommend creating a helper function:

import math

def next_pow2(val):     return int(math.pow(2, math.ceil(log2(val))))


View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:00Z ----------------------------------------------------------------

Line #54.        start = query_idx

In the DAMP paper, why do they have start = i + m? But yours is different?


NimaSarajpoor commented on 2023-06-12T21:11:49Z ----------------------------------------------------------------

The paper considers a hardcoded exclusion zone of size m. However, I am trying to be flexible here and consider excl_zone as a paramater /variable.

So, the paper computes the distance between Q = T[i : i +m] and T[i+m : i+m+lookahead] . This makes sense as we do not want to compare Q with its trivial neighbors. (note that excl zone is m in the paper). As a side note, I think, in STUMPY, we should do add +1 to the range of indices, i.e. T[i+m+1 : i+m+1+lookahead]

So, I could do something like: T[i+excl_zone+1, i+excl_zone+1+lookahead ].But I thought it might be better to just consider T[i : i + lookahead] (so the Q is basically the first subsequence of this slice) and then, after computing the distance between Qand this slice, I can do:

core.apply_exclusion_zone(distance_profile, 0, ...)

My approach is not optimal, particularly if m is large. Because, in that case, excl_zone will also be large and we are computing distance between Q and a lot of its neighbors that are in excl_zone . So, I think we may just go with this:

T[i+excl_zone+1, i+excl_zone+1+lookahead]

What do you think?

seanlaw commented on 2023-06-13T00:32:53Z ----------------------------------------------------------------

I think that sounds reasonable

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:00Z ----------------------------------------------------------------

Line #58.            dist_profile = core._mass(

Just use D inplace of dist_profile


View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:01Z ----------------------------------------------------------------

Line #72.            IDX = start + np.flatnonzero(dist_profile < discord_score)

Replace IDX with mask

mask = np.argwhere(D < bsf) + start

pruned[mask] = True


NimaSarajpoor commented on 2023-06-12T21:14:12Z ----------------------------------------------------------------

Will revise the code accordingly.

View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:02Z ----------------------------------------------------------------

Line #121.        left_nn_distance : float

just call it distance 


View / edit / reply to this conversation on ReviewNB

seanlaw commented on 2023-06-12T14:52:03Z ----------------------------------------------------------------

Line #124.        discord_score : float

Just call it bsf 


@NimaSarajpoor This looks great. I've left some initial comments for you to consider

seanlaw avatar Jun 12 '23 14:06 seanlaw

"left nearest neighbors so AS NOT TO miss, i.e., two rare subsequences that are also similar to each other"

just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well?


View entire conversation on ReviewNB

NimaSarajpoor avatar Jun 12 '23 16:06 NimaSarajpoor

I am purposely trying to avoid using "anomaly" where possible and to try and be precise

I see! Thanks for the input!


View entire conversation on ReviewNB

NimaSarajpoor avatar Jun 12 '23 16:06 NimaSarajpoor

My definition for a "discord" is a "potential anomaly" but one needs to verify whether it is an actual anomaly

Cool!


View entire conversation on ReviewNB

NimaSarajpoor avatar Jun 12 '23 16:06 NimaSarajpoor

Noted. Will revise it accordingly.


View entire conversation on ReviewNB

NimaSarajpoor avatar Jun 12 '23 16:06 NimaSarajpoor

Again, the missing article THE :) Shame on me :D Thanks for catching that.


View entire conversation on ReviewNB

NimaSarajpoor avatar Jun 12 '23 16:06 NimaSarajpoor

I will revise the code accordinly.

Side note:

I think I designed the code in a way that should work for any exclusion zone. I need to test it. (The algorithms in the paper are written in way that only consider excl_zone == m. For now, I am trying to be faithful to the paper as much as possible. That is why I am setting the excl_zone denom to 1 to reflect the paper's implementation)


View entire conversation on ReviewNB

NimaSarajpoor avatar Jun 12 '23 16:06 NimaSarajpoor

I think the only way one can get a discord with infinte value (score) is to have at least one non-finite value in the time series. Wouldn't it be better to not add the proposed argument to the API and instead just let user know regarding this matter?

I mean they can just do: core.rolling_finite(T[split_index:], m) .

What do you think?


View entire conversation on ReviewNB

NimaSarajpoor avatar Jun 12 '23 16:06 NimaSarajpoor

Will do so.


View entire conversation on ReviewNB

NimaSarajpoor avatar Jun 12 '23 16:06 NimaSarajpoor

just one question here: Did you intentionally remove "catching twin-freak cases" ? If yes, should we remove "i.e." as well?

Oops! How about:

"left nearest neighbors so AS NOT TO miss the case where you have two rare subsequences that happen to also be similar to each other (i.e., a "twin freak")"


View entire conversation on ReviewNB

seanlaw avatar Jun 12 '23 17:06 seanlaw

I agree. Let's just put a clear note in the description for PL (i.e., all infinite distances are ignored)


View entire conversation on ReviewNB

seanlaw avatar Jun 12 '23 17:06 seanlaw