stumpy
stumpy copied to clipboard
Add Top-K Multi-dimensional Motif Discovery Tutorial
Demonstrate how to find the top-K motifs and top-K nearest neighbors in multi-dimensional data
In the paper, it says:
we tested our method on an electrical load measurement dataset [21]. The dataset consists of electrical load measurements for individual appliances (and an aggregated load) from households in United Kingdom. Five appliances are considered: fridge-freezer, freezer, tumble dryer, dishwasher, and washing machine. The data was collected from April 19, 2014 to May 15, 2014, where the length is 17,000. The subsequence length was set to 4 hours.
Cross referencing the cleaned version of the REFIT dataset, it was found that only "CLEAN_House3.csv" in the "Access Dataset > REFIT Power Data - 111215" contained the aforementioned appliances. It was not possible to reproduce the paper's data preprocessing steps but this should be sufficient:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('https://raw.githubusercontent.com/TDAmeritrade/stumpy/main/docs/stumpy.mplstyle')
start_date, end_date = '2014-04-19', '2014-05-15'
df = pd.read_csv("CLEAN_House3.csv")
df["Time"] = pd.to_datetime(df["Time"])
colnames = {
"Appliance1": "Toaster",
"Appliance2": "Fridge-Freezer",
"Appliance3": "Freezer",
"Appliance4": "Tumble Dryer",
"Appliance5": "Dishwasher",
"Appliance6": "Washing Machine",
"Appliance7": "Television",
"Appliance8": "Microwave",
"Appliance9": "Kettle",
}
df = (df.rename(colnames, axis="columns")
.loc[:, ["Time",
"Fridge-Freezer",
"Freezer",
"Tumble Dryer",
"Dishwasher",
"Washing Machine"]]
.query('Time >= @start_date and Time <= @end_date')
.groupby(pd.Grouper(key='Time', freq="T")).sum()
)
for colname in df.columns:
plt.plot(df[colname])
Alternatively, there is also the PAMAP2 motif example by Jessica Lin (who has worked with Eamonn Keogh in the past). If you look at Section D, I think that the PAMAP2 dataset (Overview found here - Download zip file here) may be good for our tutorial. In the paper, they focused on the "ironing" activity and used a 12.5 second (12500 data point) window size, which they claimed allowed them to identify relevant hand motions (from the Y and Z hand coordinates) that are consistent with ironing while, presumably, other coordinates (from the chest and ankles) may have been irrelevant for ironing.
The dataset is suppose to contain 1.3 million datapoints in length which may be too much for a tutorial but I wonder if we could downsample the data by 10x (i.e., only look at every 10th data point and thereby analyzing only 130K data points for each dimension) and still get be able to convey our point. I wanted to bring it to your attention in case it may be useful.
Hey!
Since we were discussing this topic in detail here and have already generalized the multidimensional motif discovery to finding the top k motifs and top k nearest neighbors, I would like to help with this tutorial!
@SaVoAMP Sounds good! I will assign it to you
Should I just use the REFIT dataset for the tutorial?
If so, do you think it would suffice to just use the dataset locally and refer to the "download instructions" in a comment?
if not, since the download is a 489 MB archive, I can't import it "on the fly".
Sadly I don't have any servers to host the ´CLEAN_House3.csv´ file on except my own github account. I presume that the .csv file won't be part of the repo? Maybe Git LFS could be a solution here?
Do you have a link with the cleaned up version of the file that I could use for the tutorial, or should I create my own separate github project containing only the file? Or would GitHub be unsuitable for that purpose? I'm asking because my software engineering professor mentioned that GitHub/GitLab are not suited for filehosting and only sourcecode should be uploaded there. I don't know to what extent he meant that. Or would you suggest using another dataset instead?
If so, do you think it would suffice to just use the dataset locally and refer to the "download instructions" in a comment?
Usually, as with our other tutorials, I end up uploading the single CSV file to Zenodo STUMPY collection for each datasete needed. Please do not add the CSV file to the codebase as it will add unnecessary bloat. For now, in your tutorial, please assume that we'll eventually replace the pd.reas_csv
line above with a Zenodo dataset URL and no comment should be needed.
Should I just use the REFIT dataset for the tutorial?
Yes, I think we can use the REFIT dataset but only a single interesting CSV that can (hopefully?) get our point/concept across.
Does this answer your question?
I'm asking because my software engineering professor mentioned that GitHub/GitLab are not suited for filehosting and only sourcecode should be uploaded there. I don't know to what extent he meant that. Or would you suggest using another dataset instead?
They are correct and I agree with them. Git repositories should really only be for code (even Jupyter notebooks are questionable but it's been okay so far for us).
All right, thank you!
How do you want to visualize the motif pair? Should we plot it inside the (currently very colorful) graph or should I make a zoom-in of the motif pair and only plot that? As soon as we get matches we need to plot them inside the graph anyway. But if I would visualize all appliances in the same color so that we can see the motif pair and its nearest neighbors with another color, then we could not tell which appliance belonged to which time series. I would suggest to plot 5 subplots (one for each appliance) and afterwards visualize the motifs and matches with another color, as I have done here, for example:
This was a 30-dimensional data set and after selecting the k dimensions where the motif is represented, I only plotted those dimensions with all its matches. The motif pair can be seen in red.
But the problem with our dataset might be that there are a plenty of spikes in each dimension which may look odd when plotted above each other 🤔
For things like this, It's difficult for me to judge without seeing the possible outcomes in context. If it's not too much work, is it possible to submit a PR with a notebook that displays/annotates the pros/cons of each approach to visualizing the results?
In case it matters, we don't necessarily need to use all 9 dimensions from the CLEAN_House3.csv
dataset. We could always choose a subset where only 2 dimensions (i.e., washer + dryer) would be identified by MDL.
Our time series has the length 37,440. But the authors claim that their length is 17,000. If I understand everything correctly, you were converting that to minutes. I'm trying to plot the motif pair but stuggling since I only work with data frames since the beginning of my thesis. Since I was totally confused by the differing timesteps, I was only plotting "data points" on the x-axis instead of time (until now). That's why I am struggling to plot the time series in timesteps together with the motif now, since my indices aren't in time steps. Probably it will take some time until I've got it 😆 So far, I'm only managing to decently plot the time series that contain the motif. But as soon as I draw in the motif, everything breaks down 😄
I think we might have the problem, that the flat areas are more similar than the spikes that we want to catch. I'll do a pull request so that you can see what I mean. In essence the motif pair is located at the left of three time series. Don't be surprised that I commented the part of drawing lines at the beginning of the motif pair since I was to stupid to get it working today.
Our time series has the length 37,440. But the authors claim that their length is 17,000. If I understand everything correctly, you were converting that to minutes.
I think that's right. In .groupby(pd.Grouper(key='Time', freq="T")).sum()
, I'm basically grouping data points to within each minute and then summing up all of the energy consumption during that minute in order to reduce the overall data size.
I think we might have the problem, that the flat areas are more similar than the spikes that we want to catch.
I see. That's a nasty problem to have. I wonder if we can cheat/overcome this by adding ONE additional dimension that is simply the aggregate energy usage (I'm hoping that this will help reduce the flat areas)?
I don't think so since the similarity still holds in the other dimensions. I think that they will still be chosen and the additional one ignored.
What about adding white noise to constant regions?
Puhh, difficult to tell, but we could try. If it were a one-dimensional problem I would have tried it with annotation vectors. But I don't know if this is possible for the multidimensional case since each dimension needs to get annotated separately.
Puhh, difficult to tell, but we could try. If it were a one-dimensional problem I would have tried it with annotation vectors. But I don't know if this is possible for the multidimensional case since each dimension needs to get annotated separately.
I don't know why I didn't think of this earlier but we could simply replace the constant regions of our time series with np.nan
! This ensure that these regions MUST have a distance (in the distance profile) of np.inf
and they would not be selected. Maybe something like:
from stumpy import core
constant_idx = np.flatnonzero(core.rolling_nanstd(T, m) == 0.0) # Constant regions will have a 0.0 stddev
T[constant_idx] = np.nan
yeah I could try, but I think that may eliminate too much. I think its better to set only zero values to np.nan
. But I'm also not sure if this is too much. Depends on the length of the subsequence that we set 🤔
yeah I could try, but I think that may eliminate too much. I think its better to set only zero values to np.nan. But I'm also not sure if this is too much. Depends on the length of the subsequence that we set
So something like this? I think it makes sense for the length of the subsequence to be related to the size of m
.
from stumpy import core
zero_idx = np.flatnonzero(np.sum(core.rolling_window(T, m), axis=1) == 0.0) # Zero regions will sum to 0.0
constant_idx = np.flatnonzero(core.rolling_nanstd(T, m) == 0.0) # Constant regions will have a 0.0 stddev
T[constant_idx & zero_idx] = np.nan
I'm not sure if we should eliminate regions that are only constant. I think it may be better to only set those regions to np.nan
that are constantly zero since this means that there is no electrical power demand. With my other data set there were many problems when trying to eliminate some regions. I think we may eliminate too much since it looks like as if some appliances have the same electrical power demand over some time period and I'm not sure if we want to eliminate them. I think it's pretty difficult to find meaningful motifs here 🤔 I'll try experimenting a bit over the next few days.
What would you think of explaining the mmotifs
function on a simplified example first and afterwards loading the difficult data set and trying to explore further, as you have done here? I could use the data that I used for the unit tests.
What would you think of explaining the mmotifs function on a simplified example first and afterwards loading the difficult data set and trying to explore further, as you have done here? I could use the data that I used for the unit tests.
In general, I've avoided using simplified examples. However, as you've pointed out, there are times where it may be useful and I am not opposed to it as long as we follow it with a real-world example (a la REFIT data).
Somehow the partial sequence length looks very small. But the authors say that they used a subsequence length of 4 hours. But since you have changed the timestamp to minutes, I should use m=4*60
right?
Somehow the partial sequence length looks very small. But the authors say that they used a subsequence length of 4 hours. But since you have changed the timestamp to minutes, I should use
m=4*60
right?
That sounds about right to me.
I'm not sure what I'm doing wrong but it doesn't seem as if anything is eliminated.
@SaVoAMP At this stage, given that we don't know what dataset the authors had used and what preprocessing steps were taken, I don't think we can limit ourselves to reproducing their work anymore due to the lack of information. My thought process was more like "for the most part, humans tend to dry their laundry right after they wash it" so there's probably a motif there and maybe we can pick it up with matrix profiles. Maybe not? That was the extent of it. In this case, I think that we can take certain (reasonable) liberties to make things "work" as long as it is educational.
Alright!
My problem is that setting values with zero standard deviation to np.nan
doesn't work and I can't explain why. I wrote two if-conditions and when I debug, I also get into the conditions. But when setting the values to np.nan
it doesn't work, although the constant_idx
vector is occupied. I probably need to take a closer look at the data frame operations here.
Could this be because the timestamps are now here in the left column and no longer the data points 0, 1, 2, ..., n?
But I think
constant_idx
is giving me the data point positions since constant_idx=[239, 240, 241, ..., 4650, 4651, 4652]
. Is there a simple (data science) trick? I would insert another column that counts the data_points and work with them instead of the timestamps. But that can certainly be done more elegantly.
@SaVoAMP I think it would be better to continue this conversation over in your open PR so that there is continuity of the discussion. Let's continue over there if you don't mind.