stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

Matrix Profile Top Ten draft

Open ken-maeda opened this issue 3 years ago • 67 comments

Discovering motifs under uniform scaling

From #85, I add Discovering motifs under uniform scaling tutorial from 3.1 of The Matrix Profile Top Ten paper. Add docs/Tutorial_Matrix_Profile_Top_Ten.ipynb

It still needs modification to plot, markdown, coding...., but please have a look through and let me know if you find problem.

Pull Request Checklist

Below is a simple checklist but please do not hesitate to ask for assistance!

  • [x] Fork, clone, and checkout the newest version of the code
  • [x] Create a new branch
  • [x] Make necessary code changes
  • [ ] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
  • [ ] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
  • [ ] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
  • [ ] Run black . in the root stumpy directory
  • [ ] Run flake8 . in the root stumpy directory
  • [ ] Run ./setup.sh && ./test.sh in the root stumpy directory
  • [ ] Reference a Github issue (and create one if one doesn't already exist)

ken-maeda avatar Sep 05 '22 15:09 ken-maeda

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Thank you @ken-maeda. Please allow me some time to review

seanlaw avatar Sep 05 '22 17:09 seanlaw

Codecov Report

Base: 99.89% // Head: 99.90% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (164b645) compared to base (208b796). Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #667    +/-   ##
========================================
  Coverage   99.89%   99.90%            
========================================
  Files          80       80            
  Lines       11453    12289   +836     
========================================
+ Hits        11441    12277   +836     
  Misses         12       12            
Impacted Files Coverage Δ
stumpy/aamp.py 100.00% <0.00%> (ø)
stumpy/core.py 100.00% <0.00%> (ø)
tests/naive.py 100.00% <0.00%> (ø)
stumpy/aampi.py 100.00% <0.00%> (ø)
stumpy/stimp.py 100.00% <0.00%> (ø)
stumpy/stump.py 100.00% <0.00%> (ø)
stumpy/aamped.py 100.00% <0.00%> (ø)
stumpy/motifs.py 100.00% <0.00%> (ø)
stumpy/scrump.py 100.00% <0.00%> (ø)
stumpy/stumpi.py 100.00% <0.00%> (ø)
... and 19 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Sep 05 '22 18:09 codecov-commenter

Cool. This swiss army knife would be a great addition.

I was wondering if it would be okay to provide my feedbacks on this PR. I think this can help me improve my skills on reviewing code. I will try to avoid giving feedbacks that I am not sure of.

NimaSarajpoor avatar Sep 05 '22 18:09 NimaSarajpoor

@NimaSarajpoor Please go ahead and provide your feedback and I can follow!

seanlaw avatar Sep 06 '22 12:09 seanlaw

View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:03Z ----------------------------------------------------------------

I think "charpter" is typo. What about those three chapters? (Do we have datasets for them?), or did you mean "chapter 3's dataset"?

Are we going to have 10 chapters here in total? If yes, then you may want to briefly mention that. So, maybe something like this: "we have 10 chapters in total. The datasets for seven chapters are generated/simulated to reflect the corresponding figures in the original paper. The remaining three chapters ... "


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:04Z ----------------------------------------------------------------

Could you please add the "(see Section 3.1)" to the hyperlink? Because, in that way, I can quickly realize that you are talking about section 3.1 of the paper, and not section 3.1 of this notebook.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:05Z ----------------------------------------------------------------

Here,we show that the existing approaches for motif detection are limited to discovering pattern occurrences of the same length, failing to capture similarities when the occurrences are uniformly scaled along the time axis

Please add "space" after "Here,". So, it becomes: "Here,[space] we show that"

Also, would it be possible to break down this sentence and convert it into two sentences? I think this sentence is a little bit long.

Here is a time series of length 600. What is the best motif of length 120?

(Side note: you may want to briefly explain the data itself. What is this data?).

I think here is a good place to show the time series (without coloring any pattern). Readers can see the time series and think about the question themselves. Just a simple block of code like this should do the work:

T = pd.read_pickle(...)
plt.plot(T)
plt.title('TITLE')
plt.show() 

What do you think?

Then, we can continue with what you said:

"If we are asked to point out the best repeated pattern of length 120, the answer appears trivial: there is an obvious repeated sine wave of approximate length 120 in two locations.(A and B)"

two locations.(A and B)

two locations, A and B.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:06Z ----------------------------------------------------------------

Line #1.    arr_yankov1 = pd.read_pickle("dataset/31_1_yankov1.pkl")

I think it is cleaner to use the variable T instead of arr_yankov1.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:06Z ----------------------------------------------------------------

Line #2.    x_ = np.array(range(len(arr_yankov1)))

I think it might be better/cleaner to just do:

x = np.arange(len(T))

View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:07Z ----------------------------------------------------------------

Line #3.    window_ = 120

I think window_ is the length of pattern. Is that correct? If so, then, just for the sake of consistency with STUMPY, you may want to replace it with m .


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:08Z ----------------------------------------------------------------

It looks A and B are almost identical

I think it might be a good idea to continue this sentence as follows:

Let's calculate the pair-wise euclidean distance between the three patterns A, B, and C.

Then, after the calculation, readers can see that the distance between A and B is greater than the distance between A and C. Then, you can explain why: "but the cumulative error ..."


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:09Z ----------------------------------------------------------------

Line #1.    arr_a = arr_yankov1[idx_a:idx_a+window_]

I think S_a = T[idx_a : idx_a + m] is better. We usually prefer to use S when we want to talk about subsequence of a time series.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:10Z ----------------------------------------------------------------

Line #5.    def distance(x, y):

Please move this function to the top of the code block. Hence:

def ...

return ...

S_a = T[idx_a : idx_a + m] S_b = ... S_c = ...

S = np.vstack([S_a, S_b, S_c]) ...

Now, you can easily follow the code.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:11Z ----------------------------------------------------------------

Line #9.    D = linkage(t_, method='single', metric=distance)

Okay.. this is just my personal opinion. Please feel free to let me know what you think.

I think hierarchal clustering (HC) is nice, but it adds some sort of noise to this notebook. I think it may not be necessary here because we only have three time series. I think the distance value between A, B and A, C should do the work. But, if you think HC is important or if it is somehow related to other parts of this notebook, then please ignore this suggestion.


ken-maeda commented on 2022-09-09T03:35:06Z ----------------------------------------------------------------

This HC is from original paper. For me, it looks A is similar to B than C in even A, B ploting. But it might be too roundabout to explain these similiarities.

NimaSarajpoor commented on 2022-09-09T15:24:02Z ----------------------------------------------------------------

Exactly. It looks A is similar to B than C. And, I think the purpose of HC is to show that this is not true and in fact, A is similar to C than B. Right? What I am saying here is that we can show this by just calculating the distance. So, if we show d(A, C) < d(A, B) , then we should be fine. Right? (Maybe you keep HC here for this long tutorial, and remove the HC for the shorter version). I will leave it to you :)

View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:11Z ----------------------------------------------------------------

You may add the third item:

"3. Why not using the z-normalized Euclidean Distance?"

(Are you familiar with z-normalized ED? This is basically the Euclidean distance between the subsequences after being transformed by z-normalization, i.e. mean=0 and std=1). It might be nice to calculate this distance for pair A, B and pair A, C.


ken-maeda commented on 2022-09-12T08:39:23Z ----------------------------------------------------------------

I think this paper is focuing on flexibility of "time axis" by streching data.

And DTW also should bring some flexiblity on time axis. In my understanding, z-normalized ED is generally for magnitude.

I don't know maybe I just know a part of it.....

View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:12Z ----------------------------------------------------------------

Would you please explain what s is here?

Okay, after reading the text provided after the next code block, I realized s is the length of a prefix of C, which is greater than/equal to length of Q, but less than/equal to length of C. Am I right? Could you provide such information here? Because, I, as a reader, would prefer to know what s is before moving forward.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:13Z ----------------------------------------------------------------

This is a very nice plot as it clearly shows the impact of scaling! :)


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:14Z ----------------------------------------------------------------

The math equation does not show properly here. I had this problem before. My suggestion is to put the math equation in a separate cell of notebook. We usually do not need \begin{equation} / \end{equation} since the align environment should do the work. However, I do not know how jupyter notebook reads the equations. So, if you notice you still have problem, then you may want to do:

\begin{equation}
    \begin{align}
         # my code
    \end{align}    
\end{equation}

I think you did a very good job in this section. I read the top 10 before and it was very hard for me. I think what you provided here would help a lot of people in understanding that paper :)


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:14Z ----------------------------------------------------------------

Okay, after going through half of this section, I think it might be a good idea to separate the two cases (1) without stretch and (2) with stretch.

I mean...we can first show the data (without stretching). We can calculate the top motif and show it. Then, we can stretch it. We can then calculate the top-motif in this new case. (you can still show the original top-motif in this new with-stretch case). Then, you can say:

"With the second half linearly stretched by 6%, This causes the top-motif to change to snippets of random walk." (note that this sentence is some sort of conclusion. we should usually provide such statement after showing the result)


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:15Z ----------------------------------------------------------------

Line #1.    data = pd.read_pickle("dataset/31_1_MALLAT.pkl")

Can we use the same name T here? I mean, are we going to use the previous dataset (in previous section)? If not, then you may use the name T again here.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:16Z ----------------------------------------------------------------

Line #4.    w_ = 1024

Is this window size? If yes, then please use m.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:17Z ----------------------------------------------------------------

Line #6.    data_sepA = data[:-strech_div]

Can we use left and right here? what do you think?

T_left = T[:-strech_div]

T_right = T[-strech_div:]


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:17Z ----------------------------------------------------------------

Line #9.    x_B = np.arange(len(data_sepB))

It seems we are using len(...) a lot. So, maybe do this first:

n_T = len(T)

n_left = len(T_left) n_right = len(T_right)

And then reuse the variables n_T ,n_left , and n_right .


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:18Z ----------------------------------------------------------------

Line #10.    fitted_curve = interpolate.PchipInterpolator(x_B, data_sepB)

Is there any particular reason that we use PchipInterpolator here? I am not familiar with this btw :) I am just trying to get some idea about the rationality behind choosing PchipInterpolator


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:19Z ----------------------------------------------------------------

I think it might be a good idea to add the description of the plot after the plot itself. Because, when I read this, I tried to take a look at the previous figure. Then, I realized you are talking about the next figure. so, maybe we should explain here what we are going to plot. Then, we plot the figure, and then, we describe it.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:20Z ----------------------------------------------------------------

Line #1.    x_raw = np.arange(len(data))

I think it should be okay to remove _raw. So:

x = np.arange(n_T)  

Please consider applying a similar change to the next lines. For example:

mp = stumpy.stump(T, m)

View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:21Z ----------------------------------------------------------------

I think here is where you should provide that code for interpolation. Because, you did interpolation and stretch the time series. But you didn't use that data till now. (am I right?) So, to keep the flow, please provide that interpolation code here. Plot the stretched data, and then calculate the top-motif.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:21Z ----------------------------------------------------------------

Cool! If possible, please illustrate the original top-motif (one of them should be stretched now. right?) Now, reader can see what is going on!

So far so good :) You did a great job in putting the pieces together! Btw, you mentioned the challenge, but you didn't provide the solution. I think this is what you are doing in the next sections below. Right? If yes, then it might be a good idea to mention that here.

"In the following sections, we show how we can resolve this issue by using the concept of uniform-scaling" or something like that.


View / edit / reply to this conversation on ReviewNB

NimaSarajpoor commented on 2022-09-08T16:10:22Z ----------------------------------------------------------------

Line #2.    leaf_class4 = leaf.loc[leaf.iloc[:, 0] == 4, :]

Optional: maybe do this:

mask_class4 = leaf.iloc[:, 0] == 4

leaf_class4 = leaf.iloc[mask_class4, :]