stumpy icon indicating copy to clipboard operation
stumpy copied to clipboard

[ADD] Added first draft notebook for the multidimensional motif and m…

Open SaVoAMP opened this issue 3 years ago • 33 comments

…atches discovery tutorial

Pull Request Checklist

Below is a simple checklist but please do not hesitate to ask for assistance!

  • [x] Fork, clone, and checkout the newest version of the code
  • [x] Create a new branch
  • [x] Make necessary code changes
  • [x] Install black (i.e., python -m pip install black or conda install -c conda-forge black)
  • [x] Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
  • [x] Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
  • [ ] Run black . in the root stumpy directory
  • [ ] Run flake8 . in the root stumpy directory
  • [x] Run ./setup.sh && ./test.sh in the root stumpy directory
  • [x] Reference a Github issue (and create one if one doesn't already exist) #518

SaVoAMP avatar Mar 02 '22 22:03 SaVoAMP

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

I was not sure if I am supposed to run black and flake8 on jupyter notebooks too, so I skipped these steps.

SaVoAMP avatar Mar 02 '22 22:03 SaVoAMP

Codecov Report

Base: 99.89% // Head: 99.89% // Increases project coverage by +0.00% :tada:

Coverage data is based on head (d2d8876) compared to base (c01c04c). Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #557   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11399    11453   +54     
=======================================
+ Hits        11387    11441   +54     
  Misses         12       12           
Impacted Files Coverage Δ
stumpy/aamp.py 100.00% <0.00%> (ø)
stumpy/core.py 100.00% <0.00%> (ø)
tests/naive.py 100.00% <0.00%> (ø)
stumpy/stomp.py 100.00% <0.00%> (ø)
stumpy/stump.py 100.00% <0.00%> (ø)
stumpy/aamped.py 100.00% <0.00%> (ø)
stumpy/motifs.py 100.00% <0.00%> (ø)
stumpy/mpdist.py 100.00% <0.00%> (ø)
stumpy/scrump.py 100.00% <0.00%> (ø)
stumpy/scraamp.py 100.00% <0.00%> (ø)
... and 8 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Mar 02 '22 23:03 codecov-commenter

I was not sure if I am supposed to run black and flake8 on jupyter notebooks too, so I skipped these steps.

Nope. black/flake8 is not needed for notebooks. Thanks for checking

seanlaw avatar Mar 03 '22 02:03 seanlaw

@SaVoAMP Said:

Alright! My problem is that setting values with zero standard deviation to np.nan doesn't work and I can't explain why. I wrote two if-conditions and when I debug, I also get into the conditions. But when setting the values to np.nan it doesn't work, although the constant_idx vector is occupied. I probably need to take a closer look at the data frame operations here. Could this be because the timestamps are now here in the left column and no longer the data points 0, 1, 2, ..., n? But I think constant_idx is giving me the data point positions since constant_idx=[239, 240, 241, ..., 4650, 4651, 4652]. Is there a simple (data science) trick? I would insert another column that counts the data_points and work with them instead of the timestamps. But that can certainly be done more elegantly.

I will take a look at the notebook (click on the big dark purple button at the top of the page that says "ReviewNB"). Your best bet is to make your changes and push the notebook so that I can directly see what you are saying and we can comment (roughly) inline.

seanlaw avatar Mar 06 '22 19:03 seanlaw

@SaVoAMP I came across this paper by Jessica Lin (who has worked with Eamonn Keogh in the past). If you look at Section D, I think that the PAMAP2 dataset (Overview found here - Download zip file here) may be good for our tutorial. In the paper, they focused on the "ironing" activity and used a 12.5 second (12500 data point) window size, which they claimed allowed them to identify relevant hand motions (from the Y and Z hand coordinates) that are consistent with ironing while, presumably, other coordinates (from the chest and ankles) may have been irrelevant for ironing.

The dataset is suppose to contain 1.3 million datapoints in length which may be too much for a tutorial but I wonder if we could downsample the data by 10x (i.e., only look at every 10th data point and thereby analyzing only 130K data points for each dimension) and still get be able to convey our point. I wanted to bring it to your attention in case it may be useful.

seanlaw avatar Mar 07 '22 01:03 seanlaw

The dataset is suppose to contain 1.3 million datapoints in length which may be too much for a tutorial but I wonder if we could downsample the data by 10x (i.e., only look at every 10th data point and thereby analyzing only 130K data points for each dimension) and still get be able to convey our point. I wanted to bring it to your attention in case it may be useful.

I'm not sure what data set would be better 🤔 Just added your suggestions to eliminate constant regions. Now we have much better results. But I'm not sure why there is not eliminated more of the 'zero'-regions in the Tumble Dryer and Washing Machine data. Maybe we get good results if I also eliminate regions where the power demand is zero, since this regions are not important? Or to you think that would do to much? Do you think we can still get something out of the dataset, or would you rather switch to the other dataset?

SaVoAMP avatar Mar 07 '22 23:03 SaVoAMP

The dataset is suppose to contain 1.3 million datapoints in length which may be too much for a tutorial but I wonder if we could downsample the data by 10x (i.e., only look at every 10th data point and thereby analyzing only 130K data points for each dimension) and still get be able to convey our point. I wanted to bring it to your attention in case it may be useful.

I'm not sure what data set would be better 🤔 Just added your suggestions to eliminate constant regions. Now we have much better results. But I'm not sure why there is not eliminated more of the 'zero'-regions in the Tumble Dryer and Washing Machine data. Maybe we get good results if I also eliminate regions where the power demand is zero, since this regions are not important? Or to you think that would do to much? Do you think we can still get something out of the dataset, or would you rather switch to the other dataset?

I think we should exhaust our options fully before moving on. The key thing is that there is no way for anybody (including your or I) to know what the best option is until we try. The good thing is that there is no rush here. We should assess as we go. If you are willing, let's "try something and see what happens". Let me know if you are getting frustrated by anything or if there are any other ways that I can help. However, I see that you are a good problem solver too. Know that we are in this together!

seanlaw avatar Mar 08 '22 00:03 seanlaw

I also tried to eliminate those regions that are zero since they aren't interesting. Unfortunately this is eliminating too much and therefore the program doesn't work anymore. I think it might be generally difficult to find more than one motif with the dataset. Preprocessing alone is a major problem here. Either you eliminate too much or too little 😞

SaVoAMP avatar Mar 09 '22 13:03 SaVoAMP

Let me take a look

seanlaw avatar Mar 09 '22 15:03 seanlaw

[MOD] Rap visualization in a fuction

I can't lie, @SaVoAMP, I was really looking forward to seeing how you combine "Rap" and "visualization" 😸

seanlaw avatar Mar 11 '22 16:03 seanlaw

I can't lie, @SaVoAMP, I was really looking forward to seeing how you combine "Rap" and "visualization" smile_cat

Oh god, that is very embarrassing 😆 It's supposed to mean 'wrap' 🤣

SaVoAMP avatar Mar 11 '22 16:03 SaVoAMP

Hey, it's been attempted once before in this tutorial so I'm keeping my fingers crossed for a repeat 🤣

seanlaw avatar Mar 11 '22 16:03 seanlaw

@SaVoAMP No rush but please let me know when this is ready for review.

seanlaw avatar Mar 19 '22 12:03 seanlaw

Yes, I will! It's just quite a bit stressful for me with the bachelor thesis, because I'm trying to write an evaluation function that takes all possible special cases into account. My goal is to have it ready in the middle of next week and then I have a little more time for the tutorial!

At most, if you have a little more time, you could take a quick look to see if I'm going in the right direction with the tutorial 😄 Otherwise, I would first write down the complete tutorial once and then let you know if I have considered everything that is important or interesting for the tutorial in my eyes, so that you can then give me overall feedback.

SaVoAMP avatar Mar 19 '22 12:03 SaVoAMP

At most, if you have a little more time, you could take a quick look to see if I'm going in the right direction with the tutorial 😄 Otherwise, I would first write down the complete tutorial once and then let you know if I have considered everything that is important or interesting for the tutorial in my eyes, so that you can then give me overall feedback.

Everything looks pretty good so far. Maybe the only major suggestion is that for Figure 3 (Tumble Dryer), it would be good to make it 2 subplots (stacked one on top of the other) and where the top plot is what you currently have and the bottom plot is an overlay of all of the motifs+matches (don't forget to z-normalize them). Like this overlay

seanlaw avatar Mar 19 '22 12:03 seanlaw

All right, good suggestion! I was actually planning to do that, however, only after the second plot where only the motif pair was drawn in. But of course I also can do that at the first plot with default parameters!

SaVoAMP avatar Mar 19 '22 15:03 SaVoAMP

I think I'm ready for the first review now 😄

SaVoAMP avatar Mar 25 '22 21:03 SaVoAMP

Thank you for letting me know. I will find some time to review it

seanlaw avatar Mar 26 '22 00:03 seanlaw

@SaVoAMP Overall, I really like the contents! You've covered a lot of important points and explained a lot of the concepts well. I might be worth spending some more time elaborating (or reiterating) on some of your points and maybe how to think about things in multiple dimensions. Are there any "gotchas" when using mstump or mmotifs that one should always keep in mind? If so, it would be good to say it often as a "friendly reminder" or even in the form of a warning (like the pink boxes in the multi-dimensional tutorial).

seanlaw avatar Mar 26 '22 22:03 seanlaw

@SaVoAMP What joke am I missing/overlooking? :)

seanlaw avatar Apr 09 '22 15:04 seanlaw

Now our dataset contains the time series of the five appliances with timestamps now sampled in minutes. Let's visualize it! fridge freezerfreezerdishwasherWashing Machine and Tumble Dryer Ok, now that we've marveled at those beautifully visualized household appliances we're ready to visualize those timeseries as well.

The humble joke was visualizing the appliances with pictures (instead of visualizing the dataset by plotting the data). Don't worry, not all jokes in Germany are that serious, I just didn't want to go overboard with that one 😄

SaVoAMP avatar Apr 09 '22 16:04 SaVoAMP

My initial intention was to clarify the difference between a fridge-freezer and a freezer and I guess I've kind of embedded that clarification as a humble joke 🥲

SaVoAMP avatar Apr 09 '22 16:04 SaVoAMP

Gotcha! 😂

seanlaw avatar Apr 09 '22 17:04 seanlaw

Shouldn't we allow max_matches to be an array or a list of the size of max_motifs like we have done with cutoffs? I'm asking since I'm trying to find out how well the motif discovery using matrix profiles is suited to automatically annotate multidimensional time series data. I wrote a script that calculates the matrix profile for each new subsequence length and tries to find the motifs with mmotifs. I have got different sublabels that tell me that I'm searching for 5 sublabels of one type and 7 of another type for example. Then I get 2 different subsequence lengths (for example by computing the average of each sublabel type). If the two subsequences are different, I only have to compute the matrix profile and find one motif with mmotifs both times. But if they have the same length, the matrix profile should only be computed once but I want to find two motifs with mmotifs. Here I've got the problem that I can't tell mmotifs that I'm searching for 5 matches for the first motif and for 7 matches for the second motif.

Wouldn't it make more sense to also allow max_matches to be an array but set a default array that only contains the number 10, so that nothing changes to the user if he doesn't want to have a different number of matches for each motif? We could do exactly the same as we have done with cutoffs and therefore, if only one value is set, this value could be applied to every dimension. thinking

@SaVoAMP I hope that I am understanding your intent correctly but what you are describing (in terms of computing multiple multi-dimensional matrix profiles for different window sizes) sounds more complex than the average use case. Instead of changing mmotifs, I find myself wondering if you are able to get what you need in a post-processing step after mmotifs is called?

Here I've got the problem that I can't tell mmotifs that I'm searching for 5 matches for the first motif and for 7 matches for the second motif.

In this case, I would call mmotifs with max_matches=7 (or whatever the largest number needs to be) and then post-process the results afterward by removing the parts that are not needed or are too much. Even if you changed/modified max_matches to be variable for each motif, you'd still need to post-process the results anyways because the width of, say, motif_distances would still (possibly) be as wide as 7 matches for all motifs (i.e., the ones where you request max_matches=5 will have np.inf padded to it). I do not wish to return a jagged/ragged motif_distance as this would cause inconsistencies with motifs (1-D).

I understand that it may seem like "less things to compute" by providing an array/list of max_matches but that gain is likely only very, very small. I'm not sure it is something that we'd want to optimize. Again, I'm not opposed to your idea but I am trying to reduce/minimize the code complexity especially when the computational gain may not be substantial. Maybe I'm missing your point but it's starting to feel like we are on the verge of "doing too much".

By the way, today I was thinking about this again. I don't think that it would be enough to post process the results of the mmotifs function and search for the largest number of matches for every motif. Let's imagine again that we want to find one motif with max_matches=5 and another motif with max_matches=7. If the whole time series only contains 7+5=12 regions where the motifs/matches are present with only little space between them, we would eliminate too much when searching for 7 matches for both motifs in the fist step. Therefore if we want to find 5 matches for the first motif and 7 for the second one, we would first find 7 matches of the first one (that should be post processed later to only include 5 matches instead of 7) with mmotifs. Afterwards we are setting an exclusion zone and that is the reason why we are only able to find 5 more matches for the second motif and the last two matches can't be found since we have also applied the exclusion zone to the two matches that were found unsuitable for the fist motif. That means that we have found 7 matches for the first motif (which should only contain 5 matches) and could therefore only find 5 matches for the second motif that should contain 7 matches. Post processing wouldn't help here. Although we could post process the first motif to only contain the 5 matches we wouldn't be able to find the two matches that should belong to the second motif.

Here is a quick sketch to aid visualization: image

SaVoAMP avatar Apr 09 '22 18:04 SaVoAMP

By the way, today I was thinking about this again. I don't think that it would be enough to post process the results of the mmotifs function and search for the largest number of matches for every motif. Let's imagine again that we want to find one motif with max_matches=5 and another motif with max_matches=7. If the whole time series only contains 7+5=12 regions where the motifs/matches are present with only little space between them, we would eliminate too much when searching for 7 matches for both motifs in the fist step. Therefore if we want to find 5 matches for the first motif and 7 for the second one, we would first find 7 matches of the first one (that should be post processed later to only include 5 matches instead of 7) with mmotifs. Afterwards we are setting an exclusion zone and that is the reason why we are only able to find 5 more matches for the second motif and the last two matches can't be found since we have also applied the exclusion zone to the two matches that were found unsuitable for the fist motif. That means that we have found 7 matches for the first motif (which should only contain 5 matches) and could therefore only find 5 matches for the second motif that should contain 7 matches. Post processing wouldn't help here. Although we could post process the first motif to only contain the 5 matches we wouldn't be able to find the two matches that should belong to the second motif.

Hmm. In this contrived example, I'd argue that the time series data is simply too short and somewhat of a "worst-case-scenario" or extreme edge case. I'm not convinced that it's as common as you are presenting it to be. Even if that were true, mmotifs should only be used to find multi-dimensional motifs in the "easiest" cases and should not be used to solve ALL cases (especially not edge cases). Instead, if a user's data/results were this limited, they should really stop using mmotifs and pivot toward writing their own motif discovery function or establishing their own custom process. For STUMPY, I want us to keep in mind that "we don't want (mmotifs) to be everything for everyone" and should avoid being worried about these extreme edge cases. They are certainly important but they are well beyond the scope of what STUMPY can/should support. Otherwise, we'd spend all of our time chasing "perfection" when really what we should be aiming for is "good enough for obvious use cases". Does that make sense?

seanlaw avatar Apr 09 '22 22:04 seanlaw

I came to the conclusion that it wouldn't work at all since we don't know which motif will be found first. Therefor we aren't able to tell the function that the first motif should be found with 5 repetitions and the other with 7, for example. So it wouldn't make sense at all.

SaVoAMP avatar Apr 15 '22 21:04 SaVoAMP

Today I came across this paper: https://core.ac.uk/download/pdf/287941767.pdf

In the paper the authors talk about the implications of z-normalization in the matrix profile. They claim that, when comparing subsequences that are relatively flat and noisy, the resulting distance is high despite the visual similarity of these subsequences. Therefor they derived a method to eliminate worse performance of the z-normalized euclidean distance, when series contain flat and noisy subsequences. I thought that this could also be helpful for our problem in the tutorial 🤔 If you haven't looked at it yet, maybe I could take a closer look at the paper after I finish my bachelor thesis and find out if it may help.

SaVoAMP avatar Apr 20 '22 13:04 SaVoAMP

Today I came across this paper: https://core.ac.uk/download/pdf/287941767.pdf

@SaVoAMP I did come across that paper a few years ago when it was published but I have not looked at it recently. While the research may be valid, I am concerned that the tutorial will be too distracted from being focused around the mmotifs parameters. I don't want to throw too many different concepts at the reader and have it feel too overwhelming. What do you think?

I've created a separate issue #591

seanlaw avatar Apr 20 '22 13:04 seanlaw

Yes, that makes sense! So we could first finish the tutorial with an easier concept and I could dive into the other problem afterwards.

SaVoAMP avatar Apr 20 '22 14:04 SaVoAMP